Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2314: [CarbonData-2309][DataLoad] Add strategy to g...

Classic

List

22 messages Options

Options

12

[GitHub] carbondata pull request #2314: [CarbonData-2309][DataLoad] Add strategy to g...

GitHub user ndwangsen opened a pull request:

https://github.com/apache/carbondata/pull/2314

[CarbonData-2309][DataLoad] Add strategy to generate bigger carbondata files in case of small amoâ¦

In some scenario, the input amount of loading data is small, but carbondata still distribute them to each executors (nodes) to do local-sort, thus resulting to small carbondata files generated by each executor.

In some extreme conditions, if the cluster is big enough or if the amount of data is small enough, the carbondata file contains only one blocklet or page.

I think a new strategy should be introduced to solve the above problem.

The new strategy should:

be able to control the minimum amount of input data for each node
ignore data locality otherwise it may always choose a small portion of particular nodes

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?
NO
- [ ] Any backward compatibility impacted?
NO
- [ ] Document update required?
YES
- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
YES
- How it is tested? Please attach test report.
Tested in local
- Is it a performance related change? Please attach the performance test report.mac
After this PR, performance is as we expected
- Any additional information to help reviewers in testing this change.
NO
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
NO

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ndwangsen/incubator-carbondata load_min

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2314

----
commit 987921ef4d1c16e01b5c46384b8b1c356e3abe8a
Author: ndwangsen <luffy.wang@...>
Date: 2018-05-17T09:26:00Z

Add strategy to generate bigger carbondata files in case of small amount of input data

----

---

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

Github user ndwangsen closed the pull request at:

https://github.com/apache/carbondata/pull/2314

---

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

In reply to this post by qiuchenjian-2

GitHub user ndwangsen reopened a pull request:

https://github.com/apache/carbondata/pull/2314

[CARBONDATA-2309][DataLoad] Add strategy to generate bigger carbondata files in case of small amoâ¦

In some scenario, the input amount of loading data is small, but carbondata still distribute them to each executors (nodes) to do local-sort, thus resulting to small carbondata files generated by each executor.

In some extreme conditions, if the cluster is big enough or if the amount of data is small enough, the carbondata file contains only one blocklet or page.

I think a new strategy should be introduced to solve the above problem.

The new strategy should:

be able to control the minimum amount of input data for each node
ignore data locality otherwise it may always choose a small portion of particular nodes

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?
NO
- [ ] Any backward compatibility impacted?
NO
- [ ] Document update required?
YES
- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
YES
- How it is tested? Please attach test report.
Tested in local
- Is it a performance related change? Please attach the performance test report.mac
After this PR, performance is as we expected
- Any additional information to help reviewers in testing this change.
NO
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
NO

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ndwangsen/incubator-carbondata load_min

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2314

----
commit 987921ef4d1c16e01b5c46384b8b1c356e3abe8a
Author: ndwangsen <luffy.wang@...>
Date: 2018-05-17T09:26:00Z

Add strategy to generate bigger carbondata files in case of small amount of input data

----

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2314

Can one of the admins verify this patch?

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2314

Can one of the admins verify this patch?

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2314

Can one of the admins verify this patch?

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/2314

retest this please

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2314

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4843/

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2314

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6002/

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/2314

@ndwangsen This pr will help during concurrent query performance as number of task will be less, but it will impact data loading performance as it will not use all the nodes and data locality is not considered.

But adding a property is not the correct way. It's better to expose one data load strategy interface and add multiple implementation..like distribute the data loading across nodes or your pr scenario when user wants to distribute the data based on size. Please expose one data loading strategy interface and concrete implementation for each type (existing+ size based distribution) and user can configure or pass in load parameter which strategy they want to opt based on their use case. If user is not passing any strategy implementation, it should take default implementation. By this way user can add some custom strategy based on their use case

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2314

@kumarvishal09 yeah, you are right. I've communicated with @ndwangsen and a new strategy will be added.
Besides, the intention is mainly for enhancing data loading performance.
With this strategy, carbondata will generate bigger carbon files to avoid too many small files. We ignore the data locality on purpose to avoid the situation that tasks may always run part of the executors.
In a word, this strategy targets to loading small amount of input data.

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user ndwangsen commented on the issue:

https://github.com/apache/carbondata/pull/2314

@kumarvishal09 Yeah, I Has been modified in accordance with xuchuanyin's proposal, adding a strategy,this strategy targets to loading small amount of input data, Avoid generating a large number of small files.

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/2314

@ndwangsen What about the existing strategy???
Can u please explain how u are planing to handle this??

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user ndwangsen commented on the issue:

https://github.com/apache/carbondata/pull/2314

@kumarvishal09 If the user specifies or default that the minimum data load of the node is less than the average data amount of each node, the existing strategy is used to handle

---

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2314#discussion_r190455207

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -596,20 +599,50 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden

// calculate the average expected size for each node
long sizePerNode = 0;
+ long totalFileSize = 0;
if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) {
sizePerNode = blockInfos.size() / noofNodes;
sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode;
- } else if (BlockAssignmentStrategy.BLOCK_SIZE_FIRST == blockAssignmentStrategy) {
- long totalFileSize = 0;
+ } else if (BlockAssignmentStrategy.BLOCK_SIZE_FIRST == blockAssignmentStrategy
+ || BlockAssignmentStrategy.NODE_MIN_SIZE_FIRST == blockAssignmentStrategy) {
for (Distributable blockInfo : uniqueBlocks) {
totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength();
}
sizePerNode = totalFileSize / noofNodes;
}

- // assign blocks to each node
- assignBlocksByDataLocality(rtnNode2Blocks, sizePerNode, uniqueBlocks, originNode2Blocks,
- activeNodes, blockAssignmentStrategy);
+ // if enable to control the minimum amount of input data for each node
+ if (BlockAssignmentStrategy.NODE_MIN_SIZE_FIRST == blockAssignmentStrategy) {
+ long iLoadMinSize = 0;
+ // validate the property load_min_size_inmb specified by the user
+ if (CarbonUtil.validateValidIntType(loadMinSize)) {
+ iLoadMinSize = Integer.parseInt(loadMinSize);
+ } else {
+ LOGGER.warn("Invalid load_min_size_inmb value found: " + loadMinSize
+ + ", only int value greater than 0 is supported.");
+ iLoadMinSize = CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_DEFAULT;
+ }
+ // If the average expected size for each node greater than load min size,
+ // then fall back to default strategy
+ if (iLoadMinSize * 1024 * 1024 < sizePerNode) {
+ if (CarbonProperties.getInstance().isLoadSkewedDataOptimizationEnabled()) {
+ blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
+ } else {
+ blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
+ }
+ } else {
--- End diff --

Better to add log
```
LOG.info("Specified minimum data size to load is less than the average size for each node, fallback to default strategy" + blockAssignmentStrategy);
```

---

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2314#discussion_r190455538

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -575,11 +577,12 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
* @param noOfNodesInput -1 if number of nodes has to be decided
* based on block location information
* @param blockAssignmentStrategy strategy used to assign blocks
+ * @param loadMinSize the property load_min_size_inmb specified by the user
* @return a map that maps node to blocks
*/
public static Map<String, List<Distributable>> nodeBlockMapping(
List<Distributable> blockInfos, int noOfNodesInput, List<String> activeNodes,
- BlockAssignmentStrategy blockAssignmentStrategy) {
+ BlockAssignmentStrategy blockAssignmentStrategy, String loadMinSize ) {
--- End diff --

How about changing the name `loadMinSize` to `expectedMinSizePerNode`?

---

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2314#discussion_r190455345

--- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/model/LoadOption.java ---
@@ -196,6 +196,8 @@
optionsFinal.put("single_pass", String.valueOf(singlePass));
optionsFinal.put("sort_scope", "local_sort");
optionsFinal.put("sort_column_bounds", Maps.getOrDefault(options, "sort_column_bounds", ""));
+ optionsFinal.put(CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB,Maps.getOrDefault(options,CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB, CarbonCommonConstants
--- End diff --

add a space after `,`

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2314

@ndwangsen please resolve the conflicts and review comments

---

[GitHub] carbondata pull request #2314: [CARBONDATA-2309][DataLoad] Add strategy to g...

In reply to this post by qiuchenjian-2

Github user ndwangsen commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2314#discussion_r190776632

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -575,11 +577,12 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
* @param noOfNodesInput -1 if number of nodes has to be decided
* based on block location information
* @param blockAssignmentStrategy strategy used to assign blocks
+ * @param loadMinSize the property load_min_size_inmb specified by the user
* @return a map that maps node to blocks
*/
public static Map<String, List<Distributable>> nodeBlockMapping(
List<Distributable> blockInfos, int noOfNodesInput, List<String> activeNodes,
- BlockAssignmentStrategy blockAssignmentStrategy) {
+ BlockAssignmentStrategy blockAssignmentStrategy, String loadMinSize ) {
--- End diff --

okï¼I modify it according to your reviewed messageï¼thanks.

---

[GitHub] carbondata issue #2314: [CARBONDATA-2309][DataLoad] Add strategy to generate...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/2314

@ndwangsen I think still load is based on property....and based on enum you are checking which type of strategy it will choose , as already mentioned in earlier comment
Please expose one data loading strategy interface and add concrete implementation for each type (existing+ size based distribution) and user can configure or pass in load parameter which strategy they want to opt based on their use case. If user is not passing any strategy implementation, it should take default implementation. By this way user can add some custom strategy based on their use case**

---

12