[GitHub] [carbondata] shunlean opened a new pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shunlean opened a new pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox

shunlean opened a new pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847


    ### Why is this PR needed?
   
   Only after sorting temp, the write(sortTemp file) operation can run.
   For better performance, we want to do the  writeDataToFile and SortDataRows operations in parallel.
   
    ### What changes were proposed in this PR?
   
   In (Unsafe)SortDataRows, we add new threads to run write the file operation.
   About 10% time is reduced with parallel operation in one case.
       
    ### Does this PR introduce any user interface change?
    - No
    - Yes. (please explain the change and update document)
   
    ### Is any new testcase added?
    - No
    - Yes
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-659300018


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-659307713


   Add to whitelist


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-659307892


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-659309836


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3402/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-659311646


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1661/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

Zhangshunyu commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-659810429


   please check the build failure info


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#discussion_r456193818



##########
File path: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeSortDataRows.java
##########
@@ -200,25 +203,44 @@ public void startSorting() {
    * @param file file
    * @throws CarbonSortKeyAndGroupByException
    */
-  private void writeDataToFile(UnsafeCarbonRowPage rowPage, File file)
-      throws CarbonSortKeyAndGroupByException {
-    DataOutputStream stream = null;
-    try {
-      // open stream
-      stream = FileFactory.getDataOutputStream(file.getPath(),
-          parameters.getFileWriteBufferSize(), parameters.getSortTempCompressorName());
-      int actualSize = rowPage.getBuffer().getActualSize();
-      // write number of entries to the file
-      stream.writeInt(actualSize);
-      for (int i = 0; i < actualSize; i++) {
-        rowPage.writeRow(
-            rowPage.getBuffer().get(i) + rowPage.getDataBlock().getBaseOffset(), stream);
+  private void writeDataToFile(UnsafeCarbonRowPage rowPage, File file) {
+    writeService.submit(new WriteThread(rowPage, file));
+  }
+
+  public class WriteThread implements Runnable {
+    private File file;
+    private UnsafeCarbonRowPage rowPage;
+
+    public WriteThread(UnsafeCarbonRowPage rowPage, File file) {
+      this.rowPage = rowPage;
+      this.file = file;
+
+    }
+
+    @Override
+    public void run() {
+      DataOutputStream stream = null;
+      try {
+        // open stream
+        stream = FileFactory.getDataOutputStream(this.file.getPath(),
+                parameters.getFileWriteBufferSize(), parameters.getSortTempCompressorName());
+        int actualSize = rowPage.getBuffer().getActualSize();
+        // write number of entries to the file
+        stream.writeInt(actualSize);
+        for (int i = 0; i < actualSize; i++) {
+          rowPage.writeRow(
+                  rowPage.getBuffer().get(i) + rowPage.getDataBlock().getBaseOffset(), stream);
+        }
+        // add sort temp filename to and arrayList. When the list size reaches 20 then
+        // intermediate merging of sort temp files will be triggered
+        unsafeInMemoryIntermediateFileMerger.addFileToMerge(file);
+      } catch (IOException | MemoryException e) {
+        e.printStackTrace();

Review comment:
       use log4j instead of printStackStrace




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#discussion_r456193999



##########
File path: processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java
##########
@@ -37,6 +40,13 @@
 import org.apache.log4j.Logger;
 
 public class SortParameters implements Serializable {
+  
+  private ExecutorService writeService = Executors.newFixedThreadPool(5,

Review comment:
       Suggest to make it configurable when set core pool size for threadpool




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shunlean commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

shunlean commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-660848771


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-660857142


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1693/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-660858429


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3435/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shunlean commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

shunlean commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-660882567


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-660909683


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1695/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-660910195


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3437/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-661617787


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3447/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-661618161


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1705/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-662324632


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1718/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-662325092


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3460/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shunlean commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

shunlean commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-662356431


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


12