[GitHub] [carbondata] maheshrajus opened a new pull request #3912: [WIP] Global sort partitions should be determined dynamically

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] maheshrajus opened a new pull request #3912: [WIP] Global sort partitions should be determined dynamically

GitBox

maheshrajus opened a new pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912


    ### Why is this PR needed?
   
    [WIP] Global sort partitions should be determined dynamically
    ### What changes were proposed in this PR?
   
       
    ### Does this PR introduce any user interface change?
    - No
    - Yes. (please explain the change and update document)
   
    ### Is any new testcase added?
    - No
    - Yes
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [WIP] Global sort partitions should be determined dynamically

GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-687139428


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2235/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [WIP] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-687140085


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3975/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [WIP] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-688912501


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4004/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [WIP] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-688914915


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2264/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-689101595


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4012/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-689105224


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2274/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] maheshrajus commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

maheshrajus commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-705339174


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-705381449


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4333/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-705383450


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2583/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-705381449






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] maheshrajus commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

maheshrajus commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-705339174


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

akashrn5 commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r503160924



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##########
@@ -227,9 +235,19 @@ object DataLoadProcessBuilderOnSpark {
     // 2. sort
     var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
       configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
-    if (numPartitions <= 0) {
+
+    // if numPartitions user does not specify then dynamically calculate
+    if (numPartitions == 0) {
+      // get the size in bytes and convert to size in MB
+      val sizeOfDataFrame = SizeEstimator.estimate(rdd)/1000000
+      // data frame size can not be more than Int size
+      numPartitions = sizeOfDataFrame.toInt/rdd.getNumPartitions

Review comment:
       @maheshrajus can you please explain the logic used to determine the partitions in the comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

QiangCai commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r510558896



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##########
@@ -143,10 +143,18 @@ object DataLoadProcessBuilderOnSpark {
 
     var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
       configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
-    if (numPartitions <= 0) {
-      numPartitions = convertRDD.partitions.length
+
+    // if numPartitions user does not specify and not specified in config then dynamically calculate
+    if (numPartitions == 0) {
+      // get the size in bytes and convert to size in MB
+      val sizeOfDataFrame = SizeEstimator.estimate(inputRDD)/1000000
+      // data frame size can not be more than Int size
+      numPartitions = sizeOfDataFrame.toInt/inputRDD.getNumPartitions

Review comment:
       it should be   numberOfParitions = totalSize/partitionSize




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-716994724


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4696/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-716994984


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2939/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-717141636


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2947/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-717143281


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4704/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

akashrn5 commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r503187128



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##########
@@ -143,10 +143,18 @@ object DataLoadProcessBuilderOnSpark {
 
     var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
       configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
-    if (numPartitions <= 0) {
-      numPartitions = convertRDD.partitions.length
+
+    // if numPartitions user does not specify and not specified in config then dynamically calculate
+    if (numPartitions == 0) {
+      // get the size in bytes and convert to size in MB
+      val sizeOfDataFrame = SizeEstimator.estimate(inputRDD)/1000000
+      // data frame size can not be more than Int size
+      numPartitions = sizeOfDataFrame.toInt/inputRDD.getNumPartitions

Review comment:
       i think here, you should try to get the partitions based on the total load size and the partition size and not the partition number.Please check again and handle correctly.
   
   @QiangCai please have a look once




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] maheshrajus commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

GitBox
In reply to this post by GitBox

maheshrajus commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r512579906



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##########
@@ -143,10 +143,18 @@ object DataLoadProcessBuilderOnSpark {
 
     var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
       configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
-    if (numPartitions <= 0) {
-      numPartitions = convertRDD.partitions.length
+
+    // if numPartitions user does not specify and not specified in config then dynamically calculate
+    if (numPartitions == 0) {
+      // get the size in bytes and convert to size in MB
+      val sizeOfDataFrame = SizeEstimator.estimate(inputRDD)/1000000
+      // data frame size can not be more than Int size
+      numPartitions = sizeOfDataFrame.toInt/inputRDD.getNumPartitions

Review comment:
       @akashrn5 you reviewed old code, now i pushed again the new code and testing is in progress




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


12