GitHub user ndwangsen opened a pull request:
https://github.com/apache/carbondata/pull/3059 [HOTFIX][DataLoad]fix task assignment issue using NODE_MIN_SIZE_FIRST block assignment strategy fix task assignment issue using NODE_MIN_SIZE_FIRST block assignment strategy Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? NA - [ ] Any backward compatibility impacted? NA - [ ] Document update required? NA - [ ] Testing done Test OK in local env - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/ndwangsen/incubator-carbondata fix_load_min_size_bug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/3059.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3059 ---- commit 04d6bff55a5c9120ae8d5c4899a82bc63f1e2e37 Author: ndwangsen <luffy.wang@...> Date: 2019-01-09T07:10:21Z [HOTFIX][DataLoad]fix task assignment issue using NODE_MIN_SIZE_FIRST block assignment strategy. ---- --- |
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2231/ --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246286475 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -1164,4 +1156,35 @@ private static void deleteFiles(List<String> filesToBeDeleted) throws IOExceptio FileFactory.deleteFile(filePath, FileFactory.getFileType(filePath)); } } + + /** + * This method will calculate the average expected size for each node + * + * @param blockInfos blocks + * @param uniqueBlocks unique blocks + * @param numOfNodes if number of nodes has to be decided + * based on block location information + * @param blockAssignmentStrategy strategy used to assign blocks + * @return the average expected size for each node + */ + private static long calcAvgLoadSizePerNode(List<Distributable> blockInfos, --- End diff -- Please separate the code for identifying the numberOfBlocksPerNode and dataSizeperNode. Use the required variable based on the BlockAssignmentStartegy. This way this function also not required I think. Using same name for both purposes is confusing. --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246286723 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -609,6 +597,10 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST; } else { blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST; + // fall back to BLOCK_NUM_FIRST strategy need to recalculate + // the average expected size for each node + sizePerNode = calcAvgLoadSizePerNode(blockInfos,uniqueBlocks, --- End diff -- Avoid calculating times same values. Calculate once and reuse based on the strategy. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2451/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10488/ --- |
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246299700 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -609,6 +597,10 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST; } else { blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST; + // fall back to BLOCK_NUM_FIRST strategy need to recalculate + // the average expected size for each node + sizePerNode = calcAvgLoadSizePerNode(blockInfos,uniqueBlocks, --- End diff -- ok, i modify it. --- |
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246299802 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -1164,4 +1156,35 @@ private static void deleteFiles(List<String> filesToBeDeleted) throws IOExceptio FileFactory.deleteFile(filePath, FileFactory.getFileType(filePath)); } } + + /** + * This method will calculate the average expected size for each node + * + * @param blockInfos blocks + * @param uniqueBlocks unique blocks + * @param numOfNodes if number of nodes has to be decided + * based on block location information + * @param blockAssignmentStrategy strategy used to assign blocks + * @return the average expected size for each node + */ + private static long calcAvgLoadSizePerNode(List<Distributable> blockInfos, --- End diff -- ok, i modify it --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2235/ --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246309331 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden } // calculate the average expected size for each node - long sizePerNode = 0; + long numberOfBlocksPerNode = 0; + if (blockInfos.size() > 0) { + numberOfBlocksPerNode = blockInfos.size() / numOfNodes; + } + numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode; + long dataSizePerNode = 0; long totalFileSize = 0; + for (Distributable blockInfo : uniqueBlocks) { + totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength(); + } + dataSizePerNode = totalFileSize / numOfNodes; + long sizePerNode = 0; if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) { - if (blockInfos.size() > 0) { - sizePerNode = blockInfos.size() / numOfNodes; - } - sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode; + sizePerNode = numberOfBlocksPerNode; --- End diff -- Please don't change sizePerNode variable --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246311168 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -609,6 +613,9 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST; } else { blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST; + // fall back to BLOCK_NUM_FIRST strategy need to reset + // the average expected size for each node + sizePerNode = numberOfBlocksPerNode; --- End diff -- instead of reassigning the same variable, assignBlocksByDataLocality () can use numberOfBlocksPerNode directly? --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246311819 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden } // calculate the average expected size for each node - long sizePerNode = 0; + long numberOfBlocksPerNode = 0; + if (blockInfos.size() > 0) { + numberOfBlocksPerNode = blockInfos.size() / numOfNodes; + } + numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode; + long dataSizePerNode = 0; long totalFileSize = 0; + for (Distributable blockInfo : uniqueBlocks) { + totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength(); + } + dataSizePerNode = totalFileSize / numOfNodes; + long sizePerNode = 0; if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) { - if (blockInfos.size() > 0) { - sizePerNode = blockInfos.size() / numOfNodes; - } - sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode; + sizePerNode = numberOfBlocksPerNode; --- End diff -- This if else can be complete avoided and use the correct variable in the method call for blocks allocation --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2454/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10492/ --- |
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246317595 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden } // calculate the average expected size for each node - long sizePerNode = 0; + long numberOfBlocksPerNode = 0; + if (blockInfos.size() > 0) { + numberOfBlocksPerNode = blockInfos.size() / numOfNodes; + } + numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode; + long dataSizePerNode = 0; long totalFileSize = 0; + for (Distributable blockInfo : uniqueBlocks) { + totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength(); + } + dataSizePerNode = totalFileSize / numOfNodes; + long sizePerNode = 0; if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) { - if (blockInfos.size() > 0) { - sizePerNode = blockInfos.size() / numOfNodes; - } - sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode; + sizePerNode = numberOfBlocksPerNode; --- End diff -- this modify i think is ok , if using BLOCK_NUM_FIRST block assignment strategy --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246317841 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -609,6 +613,9 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST; } else { blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST; + // fall back to BLOCK_NUM_FIRST strategy need to reset + // the average expected size for each node + sizePerNode = numberOfBlocksPerNode; --- End diff -- assignLeftOverBlocks also needs this similar if else self checks. I think its ok, you can take a call to refactor now or later. --- |
In reply to this post by qiuchenjian-2
Github user qiuchenjian commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3059#discussion_r246352937 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -609,6 +609,14 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST; } else { blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST; + // fall back to BLOCK_NUM_FIRST strategy need to reset + // the average expected size for each node + if (blockInfos.size() > 0) { --- End diff -- ```suggestion if (numOfNodes > 0) { ``` if blockInfos.size() = 0 ï¼ sizePerNode will be 0ï¼ so no need to add if ... else ... Do numOfNodes need to consider be 0? --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2460/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10497/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3059 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2241/ --- |
Free forum by Nabble | Edit this page |