Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

Classic

List

22 messages Options

Options

12

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

GitHub user ndwangsen opened a pull request:

https://github.com/apache/carbondata/pull/3059

[HOTFIX][DataLoad]fix task assignment issue using NODE_MIN_SIZE_FIRST block assignment strategy

fix task assignment issue using NODE_MIN_SIZE_FIRST block assignment strategy

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?
NA
- [ ] Any backward compatibility impacted?
NA
- [ ] Document update required?
NA
- [ ] Testing done
Test OK in local env
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
NA

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ndwangsen/incubator-carbondata fix_load_min_size_bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/3059.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3059

----
commit 04d6bff55a5c9120ae8d5c4899a82bc63f1e2e37
Author: ndwangsen <luffy.wang@...>
Date: 2019-01-09T07:10:21Z

[HOTFIX][DataLoad]fix task assignment issue using NODE_MIN_SIZE_FIRST

block assignment strategy.

----

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2231/

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246286475

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -1164,4 +1156,35 @@ private static void deleteFiles(List<String> filesToBeDeleted) throws IOExceptio
FileFactory.deleteFile(filePath, FileFactory.getFileType(filePath));
}
}
+
+ /**
+ * This method will calculate the average expected size for each node
+ *
+ * @param blockInfos blocks
+ * @param uniqueBlocks unique blocks
+ * @param numOfNodes if number of nodes has to be decided
+ * based on block location information
+ * @param blockAssignmentStrategy strategy used to assign blocks
+ * @return the average expected size for each node
+ */
+ private static long calcAvgLoadSizePerNode(List<Distributable> blockInfos,
--- End diff --

Please separate the code for identifying the numberOfBlocksPerNode and dataSizeperNode.
Use the required variable based on the BlockAssignmentStartegy. This way this function also not required I think.

Using same name for both purposes is confusing.

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246286723

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -609,6 +597,10 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
} else {
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
+ // fall back to BLOCK_NUM_FIRST strategy need to recalculate
+ // the average expected size for each node
+ sizePerNode = calcAvgLoadSizePerNode(blockInfos,uniqueBlocks,
--- End diff --

Avoid calculating times same values. Calculate once and reuse based on the strategy.

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2451/

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10488/

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user ndwangsen commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246299700

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -609,6 +597,10 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
} else {
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
+ // fall back to BLOCK_NUM_FIRST strategy need to recalculate
+ // the average expected size for each node
+ sizePerNode = calcAvgLoadSizePerNode(blockInfos,uniqueBlocks,
--- End diff --

ok, i modify it.

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user ndwangsen commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246299802

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -1164,4 +1156,35 @@ private static void deleteFiles(List<String> filesToBeDeleted) throws IOExceptio
FileFactory.deleteFile(filePath, FileFactory.getFileType(filePath));
}
}
+
+ /**
+ * This method will calculate the average expected size for each node
+ *
+ * @param blockInfos blocks
+ * @param uniqueBlocks unique blocks
+ * @param numOfNodes if number of nodes has to be decided
+ * based on block location information
+ * @param blockAssignmentStrategy strategy used to assign blocks
+ * @return the average expected size for each node
+ */
+ private static long calcAvgLoadSizePerNode(List<Distributable> blockInfos,
--- End diff --

ok, i modify it

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2235/

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246309331

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
}

// calculate the average expected size for each node
- long sizePerNode = 0;
+ long numberOfBlocksPerNode = 0;
+ if (blockInfos.size() > 0) {
+ numberOfBlocksPerNode = blockInfos.size() / numOfNodes;
+ }
+ numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode;
+ long dataSizePerNode = 0;
long totalFileSize = 0;
+ for (Distributable blockInfo : uniqueBlocks) {
+ totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength();
+ }
+ dataSizePerNode = totalFileSize / numOfNodes;
+ long sizePerNode = 0;
if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) {
- if (blockInfos.size() > 0) {
- sizePerNode = blockInfos.size() / numOfNodes;
- }
- sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode;
+ sizePerNode = numberOfBlocksPerNode;
--- End diff --

Please don't change sizePerNode variable

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246311168

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -609,6 +613,9 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
} else {
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
+ // fall back to BLOCK_NUM_FIRST strategy need to reset
+ // the average expected size for each node
+ sizePerNode = numberOfBlocksPerNode;
--- End diff --

instead of reassigning the same variable, assignBlocksByDataLocality () can use numberOfBlocksPerNode directly?

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246311819

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
}

// calculate the average expected size for each node
- long sizePerNode = 0;
+ long numberOfBlocksPerNode = 0;
+ if (blockInfos.size() > 0) {
+ numberOfBlocksPerNode = blockInfos.size() / numOfNodes;
+ }
+ numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode;
+ long dataSizePerNode = 0;
long totalFileSize = 0;
+ for (Distributable blockInfo : uniqueBlocks) {
+ totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength();
+ }
+ dataSizePerNode = totalFileSize / numOfNodes;
+ long sizePerNode = 0;
if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) {
- if (blockInfos.size() > 0) {
- sizePerNode = blockInfos.size() / numOfNodes;
- }
- sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode;
+ sizePerNode = numberOfBlocksPerNode;
--- End diff --

This if else can be complete avoided and use the correct variable in the method call for blocks allocation

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2454/

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10492/

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user ndwangsen commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246317595

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -575,19 +575,23 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
}

// calculate the average expected size for each node
- long sizePerNode = 0;
+ long numberOfBlocksPerNode = 0;
+ if (blockInfos.size() > 0) {
+ numberOfBlocksPerNode = blockInfos.size() / numOfNodes;
+ }
+ numberOfBlocksPerNode = numberOfBlocksPerNode <= 0 ? 1 : numberOfBlocksPerNode;
+ long dataSizePerNode = 0;
long totalFileSize = 0;
+ for (Distributable blockInfo : uniqueBlocks) {
+ totalFileSize += ((TableBlockInfo) blockInfo).getBlockLength();
+ }
+ dataSizePerNode = totalFileSize / numOfNodes;
+ long sizePerNode = 0;
if (BlockAssignmentStrategy.BLOCK_NUM_FIRST == blockAssignmentStrategy) {
- if (blockInfos.size() > 0) {
- sizePerNode = blockInfos.size() / numOfNodes;
- }
- sizePerNode = sizePerNode <= 0 ? 1 : sizePerNode;
+ sizePerNode = numberOfBlocksPerNode;
--- End diff --

this modify i think is ok , if using BLOCK_NUM_FIRST block assignment strategy

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246317841

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -609,6 +613,9 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
} else {
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
+ // fall back to BLOCK_NUM_FIRST strategy need to reset
+ // the average expected size for each node
+ sizePerNode = numberOfBlocksPerNode;
--- End diff --

assignLeftOverBlocks also needs this similar if else self checks. I think its ok, you can take a call to refactor now or later.

---

[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...

In reply to this post by qiuchenjian-2

Github user qiuchenjian commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3059#discussion_r246352937

--- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java ---
@@ -609,6 +609,14 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST;
} else {
blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST;
+ // fall back to BLOCK_NUM_FIRST strategy need to reset
+ // the average expected size for each node
+ if (blockInfos.size() > 0) {
--- End diff --

```suggestion
if (numOfNodes > 0) {
```
if blockInfos.size() = 0 ï¼ sizePerNode will be 0ï¼ so no need to add if ... else ...
Do numOfNodes need to consider be 0?

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2460/

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10497/

---

[GitHub] carbondata issue #3059: [HOTFIX][DataLoad]fix task assignment issue using NO...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3059

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2241/

---

12