[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
GitHub user ndwangsen opened a pull request:

    https://github.com/apache/carbondata/pull/3059

    [HOTFIX][DataLoad]fix task assignment issue using NODE_MIN_SIZE_FIRST block assignment strategy

   
    fix task assignment issue using NODE_MIN_SIZE_FIRST block assignment strategy
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
     NA
     - [ ] Any backward compatibility impacted?
     NA
     - [ ] Document update required?
    NA
     - [ ] Testing done
     Test OK in local env
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
    NA


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ndwangsen/incubator-carbondata fix_load_min_size_bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/3059.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3059
   
----
commit 04d6bff55a5c9120ae8d5c4899a82bc63f1e2e37
Author: ndwangsen <luffy.wang@...>
Date:   2019-01-09T07:10:21Z

    [HOTFIX][DataLoad]fix task assignment issue using NODE_MIN_SIZE_FIRST
   
    block assignment strategy.

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2231/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246286475
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -1164,4 +1156,35 @@ private static void deleteFiles(List<String> filesToBeDeleted) throws IOExceptio
           FileFactory.deleteFile(filePath, FileFactory.getFileType(filePath));
         }
       }
    +
    +  /**
    +   * This method will calculate the average expected size for each node
    +   *
    +   * @param blockInfos blocks
    +   * @param uniqueBlocks unique blocks
    +   * @param numOfNodes if number of nodes has to be decided
    +   *                   based on block location information
    +   * @param blockAssignmentStrategy strategy used to assign blocks
    +   * @return the average expected size for each node
    +   */
    +  private static long calcAvgLoadSizePerNode(List<Distributable> blockInfos,
    --- End diff --
   
    Please separate the code for identifying the numberOfBlocksPerNode and dataSizeperNode.
    Use the required variable based on the BlockAssignmentStartegy. This way this function also not required I think.
   
    Using same name for both purposes is confusing.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246286723
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -609,6 +597,10 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
             } else {
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
    +          // fall back to BLOCK_NUM_FIRST strategy need to recalculate
    +          // the average expected size for each node
    +          sizePerNode = calcAvgLoadSizePerNode(blockInfos,uniqueBlocks,
    --- End diff --
   
    Avoid calculating times same values. Calculate once and reuse based on the strategy.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2451/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10488/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246299700
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -609,6 +597,10 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
             } else {
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
    +          // fall back to BLOCK_NUM_FIRST strategy need to recalculate
    +          // the average expected size for each node
    +          sizePerNode = calcAvgLoadSizePerNode(blockInfos,uniqueBlocks,
    --- End diff --
   
    ok, i modify it.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246299802
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -1164,4 +1156,35 @@ private static void deleteFiles(List<String> filesToBeDeleted) throws IOExceptio
           FileFactory.deleteFile(filePath, FileFactory.getFileType(filePath));
         }
       }
    +
    +  /**
    +   * This method will calculate the average expected size for each node
    +   *
    +   * @param blockInfos blocks
    +   * @param uniqueBlocks unique blocks
    +   * @param numOfNodes if number of nodes has to be decided
    +   *                   based on block location information
    +   * @param blockAssignmentStrategy strategy used to assign blocks
    +   * @return the average expected size for each node
    +   */
    +  private static long calcAvgLoadSizePerNode(List<Distributable> blockInfos,
    --- End diff --
   
    ok, i modify it


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2235/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246309331
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
         }
     
         // calculate the average expected size for each node
    -    long sizePerNode = 0;
    +    long numberOfBlocksPerNode = 0;
    +    if (blockInfos.size() > 0) {
    +      numberOfBlocksPerNode = blockInfos.size() / numOfNodes;
    +    }
    +    numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode;
    +    long dataSizePerNode = 0;
         long totalFileSize = 0;
    +    for (Distributable blockInfo : uniqueBlocks) {
    +      totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength();
    +    }
    +    dataSizePerNode = totalFileSize / numOfNodes;
    +    long sizePerNode = 0;
         if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) {
    -      if (blockInfos.size() > 0) {
    -        sizePerNode = blockInfos.size() / numOfNodes;
    -      }
    -      sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode;
    +      sizePerNode = numberOfBlocksPerNode;
    --- End diff --
   
    Please don't change sizePerNode variable


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246311168
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -609,6 +613,9 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
             } else {
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
    +          // fall back to BLOCK_NUM_FIRST strategy need to reset
    +          // the average expected size for each node
    +          sizePerNode = numberOfBlocksPerNode;
    --- End diff --
   
    instead of reassigning the same variable, assignBlocksByDataLocality () can use numberOfBlocksPerNode directly?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246311819
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
         }
     
         // calculate the average expected size for each node
    -    long sizePerNode = 0;
    +    long numberOfBlocksPerNode = 0;
    +    if (blockInfos.size() > 0) {
    +      numberOfBlocksPerNode = blockInfos.size() / numOfNodes;
    +    }
    +    numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode;
    +    long dataSizePerNode = 0;
         long totalFileSize = 0;
    +    for (Distributable blockInfo : uniqueBlocks) {
    +      totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength();
    +    }
    +    dataSizePerNode = totalFileSize / numOfNodes;
    +    long sizePerNode = 0;
         if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) {
    -      if (blockInfos.size() > 0) {
    -        sizePerNode = blockInfos.size() / numOfNodes;
    -      }
    -      sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode;
    +      sizePerNode = numberOfBlocksPerNode;
    --- End diff --
   
    This if else can be complete avoided and use the correct variable in the method call for blocks allocation


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2454/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Failed  with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10492/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246317595
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
         }
     
         // calculate the average expected size for each node
    -    long sizePerNode = 0;
    +    long numberOfBlocksPerNode = 0;
    +    if (blockInfos.size() > 0) {
    +      numberOfBlocksPerNode = blockInfos.size() / numOfNodes;
    +    }
    +    numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode;
    +    long dataSizePerNode = 0;
         long totalFileSize = 0;
    +    for (Distributable blockInfo : uniqueBlocks) {
    +      totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength();
    +    }
    +    dataSizePerNode = totalFileSize / numOfNodes;
    +    long sizePerNode = 0;
         if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) {
    -      if (blockInfos.size() > 0) {
    -        sizePerNode = blockInfos.size() / numOfNodes;
    -      }
    -      sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode;
    +      sizePerNode = numberOfBlocksPerNode;
    --- End diff --
   
    this modify i think is ok , if using BLOCK_NUM_FIRST block assignment strategy


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246317841
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -609,6 +613,9 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
             } else {
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
    +          // fall back to BLOCK_NUM_FIRST strategy need to reset
    +          // the average expected size for each node
    +          sizePerNode = numberOfBlocksPerNode;
    --- End diff --
   
    assignLeftOverBlocks also needs this similar if else self checks. I think its ok, you can   take a call to refactor now or later.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user qiuchenjian commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/3059#discussion_r246352937
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -609,6 +609,14 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
             } else {
               blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
    +          // fall back to BLOCK_NUM_FIRST strategy need to reset
    +          // the average expected size for each node
    +          if (blockInfos.size() > 0) {
    --- End diff --
   
    ```suggestion
              if (numOfNodes > 0) {
    ```
    if  blockInfos.size() = 0 , sizePerNode will be 0, so no need to add if ... else ...
    Do numOfNodes need to consider be 0?  


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2460/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10497/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/3059
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2241/



---
12