[GitHub] carbondata pull request #2314: [CarbonData-2309][DataLoad] Add strategy to g...

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2314: [CarbonData-2309][DataLoad] Add strategy to g...

qiuchenjian-2
GitHub user ndwangsen opened a pull request:

    https://github.com/apache/carbondata/pull/2314

    [CarbonData-2309][DataLoad] Add strategy to generate bigger carbondata files in case of small amo…

    In some scenario, the input amount of loading data is small, but carbondata still distribute them to each executors (nodes) to do local-sort, thus resulting to small carbondata files generated by each executor.
   
    In  some extreme conditions, if the cluster is big enough or if the amount of data is small enough, the carbondata file contains only one blocklet or page.
   
    I  think a new strategy should be introduced to solve the above problem.
   
    The new strategy should:
   
    be able to control the minimum amount of input data for each node
    ignore data locality otherwise it may always choose a small portion of particular nodes
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
       NO
     - [ ] Any backward compatibility impacted?
       NO
     - [ ] Document update required?
       YES
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
               YES
            - How it is tested? Please attach test report.
              Tested in local
            - Is it a performance related change? Please attach the performance test report.mac
              After this PR, performance is as we expected
            - Any additional information to help reviewers in testing this change.
              NO
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
              NO


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ndwangsen/incubator-carbondata load_min

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2314
   
----
commit 987921ef4d1c16e01b5c46384b8b1c356e3abe8a
Author: ndwangsen <luffy.wang@...>
Date:   2018-05-17T09:26:00Z

    Add strategy to generate bigger carbondata files in case of small amount of input data

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

qiuchenjian-2
Github user ndwangsen closed the pull request at:

    https://github.com/apache/carbondata/pull/2314


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

qiuchenjian-2
In reply to this post by qiuchenjian-2
GitHub user ndwangsen reopened a pull request:

    https://github.com/apache/carbondata/pull/2314

    [CARBONDATA-2309][DataLoad] Add strategy to generate bigger carbondata files in case of small amo…

    In some scenario, the input amount of loading data is small, but carbondata still distribute them to each executors (nodes) to do local-sort, thus resulting to small carbondata files generated by each executor.
   
    In  some extreme conditions, if the cluster is big enough or if the amount of data is small enough, the carbondata file contains only one blocklet or page.
   
    I  think a new strategy should be introduced to solve the above problem.
   
    The new strategy should:
   
    be able to control the minimum amount of input data for each node
    ignore data locality otherwise it may always choose a small portion of particular nodes
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
       NO
     - [ ] Any backward compatibility impacted?
       NO
     - [ ] Document update required?
       YES
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
               YES
            - How it is tested? Please attach test report.
              Tested in local
            - Is it a performance related change? Please attach the performance test report.mac
              After this PR, performance is as we expected
            - Any additional information to help reviewers in testing this change.
              NO
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
              NO


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ndwangsen/incubator-carbondata load_min

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2314
   
----
commit 987921ef4d1c16e01b5c46384b8b1c356e3abe8a
Author: ndwangsen <luffy.wang@...>
Date:   2018-05-17T09:26:00Z

    Add strategy to generate bigger carbondata files in case of small amount of input data

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    Can one of the admins verify this patch?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    Can one of the admins verify this patch?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    Can one of the admins verify this patch?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4843/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6002/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    @ndwangsen This pr will help during concurrent query performance as number of task will be less,  but it will impact data loading performance as it will not use all the nodes and data locality is not considered.
   
    But adding a property is not the correct way. It's better to expose one data load strategy interface and add multiple implementation..like distribute the data loading across nodes or your pr scenario when user wants to distribute the data based on size. Please expose one data loading strategy interface and  concrete implementation for each type (existing+ size based distribution) and user can configure or pass in load parameter which strategy they want to opt based on their use case. If user is not passing any strategy implementation, it should take default implementation.  By this way user can add some custom strategy based on their use case


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    @kumarvishal09 yeah, you are right. I've communicated with @ndwangsen and a new strategy will be added.
    Besides, the intention is mainly for enhancing data loading performance.
    With this strategy, carbondata will generate bigger carbon files to avoid too many small files. We ignore the data locality on purpose to avoid the situation that tasks may always run part of the executors.
    In a word, this strategy targets to loading small amount of input data.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    @kumarvishal09 Yeah, I Has been modified in accordance with xuchuanyin's proposal, adding a strategy,this strategy targets to loading small amount of input data, Avoid generating a large number of small files.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    @ndwangsen What about the existing strategy???
     Can u please explain how u are planing to handle this??


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    @kumarvishal09  If the user specifies or default that the minimum data load of the node is less than the average data amount of each node, the existing strategy is used to handle


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2314#discussion_r190455207
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -596,20 +599,50 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
     
         // calculate the average expected size for each node
         long sizePerNode = 0;
    +    long totalFileSize = 0;
         if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) {
           sizePerNode = blockInfos.size() / noofNodes;
           sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode;
    -    } else if (BlockAssignmentStrategy.BLOCK_SIZE_FIRST == blockAssignmentStrategy) {
    -      long totalFileSize = 0;
    +    } else if (BlockAssignmentStrategy.BLOCK_SIZE_FIRST == blockAssignmentStrategy
    +        || BlockAssignmentStrategy.NODE_MIN_SIZE_FIRST == blockAssignmentStrategy) {
           for (Distributable blockInfo : uniqueBlocks) {
             totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength();
           }
           sizePerNode = totalFileSize / noofNodes;
         }
     
    -    // assign blocks to each node
    -    assignBlocksByDataLocality(rtnNode2Blocks, sizePerNode, uniqueBlocks, originNode2Blocks,
    -        activeNodes, blockAssignmentStrategy);
    +    // if enable to control the minimum amount of input data for each node
    +    if (BlockAssignmentStrategy.NODE_MIN_SIZE_FIRST == blockAssignmentStrategy) {
    +      long iLoadMinSize = 0;
    +      // validate the property load_min_size_inmb specified by the user
    +      if (CarbonUtil.validateValidIntType(loadMinSize)) {
    +        iLoadMinSize = Integer.parseInt(loadMinSize);
    +      } else {
    +        LOGGER.warn("Invalid load_min_size_inmb value found: " + loadMinSize
    +            + ", only int value greater than 0 is supported.");
    +        iLoadMinSize = CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_DEFAULT;
    +      }
    +      // If the average expected size for each node greater than load min size,
    +      // then fall back to default strategy
    +      if (iLoadMinSize * 1024 * 1024 < sizePerNode) {
    +        if (CarbonProperties.getInstance().isLoadSkewedDataOptimizationEnabled()) {
    +          blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
    +        } else {
    +          blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
    +        }
    +      } else {
    --- End diff --
   
    Better to add log
    ```
    LOG.info("Specified minimum data size to load is less than the average size for each node, fallback to default strategy" + blockAssignmentStrategy);
    ```


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2314#discussion_r190455538
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -575,11 +577,12 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
        * @param noOfNodesInput -1 if number of nodes has to be decided
        *                       based on block location information
        * @param blockAssignmentStrategy strategy used to assign blocks
    +   * @param loadMinSize the property load_min_size_inmb specified by the user
        * @return a map that maps node to blocks
        */
       public static Map<String, List<Distributable>> nodeBlockMapping(
           List<Distributable> blockInfos, int noOfNodesInput, List<String> activeNodes,
    -      BlockAssignmentStrategy blockAssignmentStrategy) {
    +      BlockAssignmentStrategy blockAssignmentStrategy, String loadMinSize ) {
    --- End diff --
   
    How about changing the name `loadMinSize` to `expectedMinSizePerNode`?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2314#discussion_r190455345
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/model/LoadOption.java ---
    @@ -196,6 +196,8 @@
         optionsFinal.put("single_pass", String.valueOf(singlePass));
         optionsFinal.put("sort_scope", "local_sort");
         optionsFinal.put("sort_column_bounds", Maps.getOrDefault(options, "sort_column_bounds", ""));
    +    optionsFinal.put(CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB,Maps.getOrDefault(options,CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB, CarbonCommonConstants
    --- End diff --
   
    add a space after `,`


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    @ndwangsen please resolve the conflicts and review comments


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2314#discussion_r190776632
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
    @@ -575,11 +577,12 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
        * @param noOfNodesInput -1 if number of nodes has to be decided
        *                       based on block location information
        * @param blockAssignmentStrategy strategy used to assign blocks
    +   * @param loadMinSize the property load_min_size_inmb specified by the user
        * @return a map that maps node to blocks
        */
       public static Map<String, List<Distributable>> nodeBlockMapping(
           List<Distributable> blockInfos, int noOfNodesInput, List<String> activeNodes,
    -      BlockAssignmentStrategy blockAssignmentStrategy) {
    +      BlockAssignmentStrategy blockAssignmentStrategy, String loadMinSize ) {
    --- End diff --
   
    ok,I modify it according to your reviewed message,thanks.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2314
 
    @ndwangsen I think still load is based on property....and based on enum you are checking which type of strategy it will choose , as already mentioned in earlier comment
    Please expose one data loading strategy interface and add concrete implementation for each type (existing+ size based distribution) and user can configure or pass in load parameter which strategy they want to opt based on their use case. If user is not passing any strategy implementation, it should take default implementation. By this way user can add some custom strategy based on their use case**


---
12