Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] marchpure opened a new pull request #3620: [CARBONDATA-3700] Optimize prune performance when prunning with multi…

Classic

List

72 messages Options

Options

1234

GitBox

[GitHub] [carbondata] marchpure opened a new pull request #3620: [CARBONDATA-3700] Optimize prune performance when prunning with multi…

marchpure opened a new pull request #3620: [CARBONDATA-3700] Optimize prune performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620

…-threads

Why is this PR needed?
When pruning with multi-threads, there is a bug hambers the prunning performance heavily.
When the pruning results in no blocklets to map the query filter, The getExtendblocklet function will be triggered to get the extend blocklet metadata, when the Input of this function is an empty blocklet list, this function is expected to return an empty extendblocklet list directyly , but now there is a bug leading to "a hashset add operation" overhead which is meaningless.
Meanwhile, When pruning with multi-threads, the getExtendblocklet function will be triggerd for each blocklet, which should be avoided by triggerring this function for each segment.

What changes were proposed in this PR?
1) if the input is an empty blocklet list in the getExtendblocklet function, we return an empty extendblocklet list directyly
2) We trigger the getExtendblocklet functon for each segment instead of each blocklet.

Does this PR introduce any user interface change?
No.

Is any new testcase added?
Yes.

### Why is this PR needed?

### What changes were proposed in this PR?

### Does this PR introduce any user interface change?
- No
- Yes. (please explain the change and update document)

### Is any new testcase added?
- No
- Yes

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#issuecomment-586338942

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/293/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#issuecomment-586339554

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1997/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] marchpure commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

marchpure commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#issuecomment-586341516

retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#issuecomment-586351917

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/294/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#issuecomment-586385984

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1998/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r379987630

##########
File path: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java
##########
@@ -138,8 +138,12 @@ public CarbonTable getTable() {
}
}
int numOfThreadsForPruning = CarbonProperties.getNumOfThreadsForPruning();
+ int carbonDriverPruningMultiThreadEnableFilesCount =
+ Integer.parseInt(CarbonProperties.getInstance().getProperty(
+ CarbonCommonConstants.CARBON_DRIVER_PRUNING_MULTI_THREAD_ENABLE_FILES_COUNT,

Review comment:
Need update the document for the new property added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r379987838

##########
File path: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java
##########
@@ -138,8 +138,12 @@ public CarbonTable getTable() {
}
}
int numOfThreadsForPruning = CarbonProperties.getNumOfThreadsForPruning();
+ int carbonDriverPruningMultiThreadEnableFilesCount =

Review comment:
Need to add validation for carbon property, if someone configures negative value. Need to use the default value

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r379988045

##########
File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/blockprune/BlockPruneQueryTestCase.scala
##########
@@ -18,16 +18,30 @@ package org.apache.carbondata.spark.testsuite.blockprune

import java.io.DataOutputStream

+import org.apache.carbondata.core.constants.CarbonCommonConstants
import org.apache.spark.sql.Row
import org.scalatest.BeforeAndAfterAll
import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
import org.apache.spark.sql.test.util.QueryTest

/**
* This class contains test cases for block prune query
*/
class BlockPruneQueryTestCase extends QueryTest with BeforeAndAfterAll {
val outputPath = s"$resourcesPath/block_prune_test.csv"
+ val MULTI_THREAD_ENABLE_FILES_COUNT = "1";

Review comment:
use small case for variable names

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r379988045

##########
File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/blockprune/BlockPruneQueryTestCase.scala
##########
@@ -18,16 +18,30 @@ package org.apache.carbondata.spark.testsuite.blockprune

import java.io.DataOutputStream

+import org.apache.carbondata.core.constants.CarbonCommonConstants
import org.apache.spark.sql.Row
import org.scalatest.BeforeAndAfterAll
import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
import org.apache.spark.sql.test.util.QueryTest

/**
* This class contains test cases for block prune query
*/
class BlockPruneQueryTestCase extends QueryTest with BeforeAndAfterAll {
val outputPath = s"$resourcesPath/block_prune_test.csv"
+ val MULTI_THREAD_ENABLE_FILES_COUNT = "1";

Review comment:
use camel case for variable names

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r379988354

##########
File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/blockprune/BlockPruneQueryTestCase.scala
##########
@@ -18,16 +18,30 @@ package org.apache.carbondata.spark.testsuite.blockprune

import java.io.DataOutputStream

+import org.apache.carbondata.core.constants.CarbonCommonConstants
import org.apache.spark.sql.Row
import org.scalatest.BeforeAndAfterAll
import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
import org.apache.spark.sql.test.util.QueryTest

/**
* This class contains test cases for block prune query
*/
class BlockPruneQueryTestCase extends QueryTest with BeforeAndAfterAll {
val outputPath = s"$resourcesPath/block_prune_test.csv"
+ val MULTI_THREAD_ENABLE_FILES_COUNT = "1";

Review comment:
Still it won't prune multi-thread as other conditions may not satisfy

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r379988354

##########
File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/blockprune/BlockPruneQueryTestCase.scala
##########
@@ -18,16 +18,30 @@ package org.apache.carbondata.spark.testsuite.blockprune

import java.io.DataOutputStream

+import org.apache.carbondata.core.constants.CarbonCommonConstants
import org.apache.spark.sql.Row
import org.scalatest.BeforeAndAfterAll
import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
import org.apache.spark.sql.test.util.QueryTest

/**
* This class contains test cases for block prune query
*/
class BlockPruneQueryTestCase extends QueryTest with BeforeAndAfterAll {
val outputPath = s"$resourcesPath/block_prune_test.csv"
+ val MULTI_THREAD_ENABLE_FILES_COUNT = "1";

Review comment:
Still it won't prune multi-thread as other conditions may not satisfy

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

ajantha-bhat commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#issuecomment-586821309

good finding, it can avoid unnecessary creation of `TableBlockIndexUniqueIdentifierWrapper` if the pruned blocklet is zero size.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r382530139

##########
File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/blockprune/BlockPruneQueryTestCase.scala
##########
@@ -18,16 +18,30 @@ package org.apache.carbondata.spark.testsuite.blockprune

import java.io.DataOutputStream

+import org.apache.carbondata.core.constants.CarbonCommonConstants
import org.apache.spark.sql.Row
import org.scalatest.BeforeAndAfterAll
import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
import org.apache.spark.sql.test.util.QueryTest

/**
* This class contains test cases for block prune query
*/
class BlockPruneQueryTestCase extends QueryTest with BeforeAndAfterAll {
val outputPath = s"$resourcesPath/block_prune_test.csv"
+ val MULTI_THREAD_ENABLE_FILES_COUNT = "1";
+ val MULTI_THREAD_DISABLE_FILES_COUNT
+ = CarbonCommonConstants.CARBON_DRIVER_PRUNING_MULTI_THREAD_ENABLE_FILES_COUNT_DEFAULT;
+
+ def perpareCarbonProperty(propertyName:String,
+ propertyValue:String): Unit ={
+ val properties = CarbonProperties.getInstance()
+ properties.removeProperty(propertyName)

Review comment:
removeProperty may not be required, as addProperty in next line will update the key with new value if already present

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] marchpure commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

marchpure commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r384321352

##########
File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/blockprune/BlockPruneQueryTestCase.scala
##########
@@ -18,16 +18,30 @@ package org.apache.carbondata.spark.testsuite.blockprune

import java.io.DataOutputStream

+import org.apache.carbondata.core.constants.CarbonCommonConstants
import org.apache.spark.sql.Row
import org.scalatest.BeforeAndAfterAll
import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
import org.apache.spark.sql.test.util.QueryTest

/**
* This class contains test cases for block prune query
*/
class BlockPruneQueryTestCase extends QueryTest with BeforeAndAfterAll {
val outputPath = s"$resourcesPath/block_prune_test.csv"
+ val MULTI_THREAD_ENABLE_FILES_COUNT = "1";
+ val MULTI_THREAD_DISABLE_FILES_COUNT
+ = CarbonCommonConstants.CARBON_DRIVER_PRUNING_MULTI_THREAD_ENABLE_FILES_COUNT_DEFAULT;
+
+ def perpareCarbonProperty(propertyName:String,
+ propertyValue:String): Unit ={
+ val properties = CarbonProperties.getInstance()
+ properties.removeProperty(propertyName)

Review comment:
resolved

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] marchpure commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

marchpure commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r384321402

##########
File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/blockprune/BlockPruneQueryTestCase.scala
##########
@@ -18,16 +18,30 @@ package org.apache.carbondata.spark.testsuite.blockprune

import java.io.DataOutputStream

+import org.apache.carbondata.core.constants.CarbonCommonConstants
import org.apache.spark.sql.Row
import org.scalatest.BeforeAndAfterAll
import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
import org.apache.spark.sql.test.util.QueryTest

/**
* This class contains test cases for block prune query
*/
class BlockPruneQueryTestCase extends QueryTest with BeforeAndAfterAll {
val outputPath = s"$resourcesPath/block_prune_test.csv"
+ val MULTI_THREAD_ENABLE_FILES_COUNT = "1";

Review comment:
resolved

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] marchpure commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

marchpure commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r384321422

##########
File path: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java
##########
@@ -138,8 +138,12 @@ public CarbonTable getTable() {
}
}
int numOfThreadsForPruning = CarbonProperties.getNumOfThreadsForPruning();
+ int carbonDriverPruningMultiThreadEnableFilesCount =

Review comment:
resolved

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] marchpure commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

marchpure commented on a change in pull request #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#discussion_r384321440

##########
File path: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java
##########
@@ -138,8 +138,12 @@ public CarbonTable getTable() {
}
}
int numOfThreadsForPruning = CarbonProperties.getNumOfThreadsForPruning();
+ int carbonDriverPruningMultiThreadEnableFilesCount =
+ Integer.parseInt(CarbonProperties.getInstance().getProperty(
+ CarbonCommonConstants.CARBON_DRIVER_PRUNING_MULTI_THREAD_ENABLE_FILES_COUNT,

Review comment:
resolved

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] marchpure commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

marchpure commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#issuecomment-591288619

retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3620: [CARBONDATA-3700] Optimize pruning performance when prunning with multi…
URL: https://github.com/apache/carbondata/pull/3620#issuecomment-591295234

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/487/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

1234