[GitHub] [carbondata] shunlean opened a new pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-662414323


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1722/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-662414760


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3464/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#issuecomment-664141006


   @shunlean : please handle the comments given by @Zhangshunyu


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#discussion_r460667652



##########
File path: processing/src/main/java/org/apache/carbondata/processing/loading/sort/impl/UnsafeParallelReadMergeSorterWithColumnRangeImpl.java
##########
@@ -99,6 +101,8 @@ public void initialize(SortParameters sortParameters) {
     UnsafeSortDataRows[] sortDataRows = new UnsafeSortDataRows[columnRangeInfo.getNumOfRanges()];
     intermediateFileMergers = new UnsafeIntermediateMerger[columnRangeInfo.getNumOfRanges()];
     SortParameters[] sortParameterArray = new SortParameters[columnRangeInfo.getNumOfRanges()];
+    this.writeService = Executors.newFixedThreadPool(originSortParameters.getNumberOfCores(),

Review comment:
       If you increase `carbon.number.of.cores.while.loading`, there will be more UnsafeSortDataRows and writing temp files can finish faster without any of these changes.
   
   Is it necessary to introduce another multi-thread here ?
   please tell your opinion @kevinjmh @kumarvishal09




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#discussion_r460667652



##########
File path: processing/src/main/java/org/apache/carbondata/processing/loading/sort/impl/UnsafeParallelReadMergeSorterWithColumnRangeImpl.java
##########
@@ -99,6 +101,8 @@ public void initialize(SortParameters sortParameters) {
     UnsafeSortDataRows[] sortDataRows = new UnsafeSortDataRows[columnRangeInfo.getNumOfRanges()];
     intermediateFileMergers = new UnsafeIntermediateMerger[columnRangeInfo.getNumOfRanges()];
     SortParameters[] sortParameterArray = new SortParameters[columnRangeInfo.getNumOfRanges()];
+    this.writeService = Executors.newFixedThreadPool(originSortParameters.getNumberOfCores(),

Review comment:
       If we increase `carbon.number.of.cores.while.loading`, there will be more UnsafeSortDataRows and writing temp files can finish faster without any of these changes.
   
   Is it necessary to introduce another multi-thread here ?
   please tell your opinion @kevinjmh @kumarvishal09




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shunlean commented on a change in pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

shunlean commented on a change in pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#discussion_r460741367



##########
File path: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeSortDataRows.java
##########
@@ -200,25 +203,44 @@ public void startSorting() {
    * @param file file
    * @throws CarbonSortKeyAndGroupByException
    */
-  private void writeDataToFile(UnsafeCarbonRowPage rowPage, File file)
-      throws CarbonSortKeyAndGroupByException {
-    DataOutputStream stream = null;
-    try {
-      // open stream
-      stream = FileFactory.getDataOutputStream(file.getPath(),
-          parameters.getFileWriteBufferSize(), parameters.getSortTempCompressorName());
-      int actualSize = rowPage.getBuffer().getActualSize();
-      // write number of entries to the file
-      stream.writeInt(actualSize);
-      for (int i = 0; i < actualSize; i++) {
-        rowPage.writeRow(
-            rowPage.getBuffer().get(i) + rowPage.getDataBlock().getBaseOffset(), stream);
+  private void writeDataToFile(UnsafeCarbonRowPage rowPage, File file) {
+    writeService.submit(new WriteThread(rowPage, file));
+  }
+
+  public class WriteThread implements Runnable {
+    private File file;
+    private UnsafeCarbonRowPage rowPage;
+
+    public WriteThread(UnsafeCarbonRowPage rowPage, File file) {
+      this.rowPage = rowPage;
+      this.file = file;
+
+    }
+
+    @Override
+    public void run() {
+      DataOutputStream stream = null;
+      try {
+        // open stream
+        stream = FileFactory.getDataOutputStream(this.file.getPath(),
+                parameters.getFileWriteBufferSize(), parameters.getSortTempCompressorName());
+        int actualSize = rowPage.getBuffer().getActualSize();
+        // write number of entries to the file
+        stream.writeInt(actualSize);
+        for (int i = 0; i < actualSize; i++) {
+          rowPage.writeRow(
+                  rowPage.getBuffer().get(i) + rowPage.getDataBlock().getBaseOffset(), stream);
+        }
+        // add sort temp filename to and arrayList. When the list size reaches 20 then
+        // intermediate merging of sort temp files will be triggered
+        unsafeInMemoryIntermediateFileMerger.addFileToMerge(file);
+      } catch (IOException | MemoryException e) {
+        e.printStackTrace();

Review comment:
       ok, done.

##########
File path: processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java
##########
@@ -37,6 +40,13 @@
 import org.apache.log4j.Logger;
 
 public class SortParameters implements Serializable {
+  
+  private ExecutorService writeService = Executors.newFixedThreadPool(5,

Review comment:
       ok,done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kevinjmh commented on a change in pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

kevinjmh commented on a change in pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#discussion_r460750438



##########
File path: processing/src/main/java/org/apache/carbondata/processing/loading/sort/impl/UnsafeParallelReadMergeSorterWithColumnRangeImpl.java
##########
@@ -99,6 +101,8 @@ public void initialize(SortParameters sortParameters) {
     UnsafeSortDataRows[] sortDataRows = new UnsafeSortDataRows[columnRangeInfo.getNumOfRanges()];
     intermediateFileMergers = new UnsafeIntermediateMerger[columnRangeInfo.getNumOfRanges()];
     SortParameters[] sortParameterArray = new SortParameters[columnRangeInfo.getNumOfRanges()];
+    this.writeService = Executors.newFixedThreadPool(originSortParameters.getNumberOfCores(),

Review comment:
       @ajantha-bhat Good point. So the only difference is adding threads horizontally or vertically.  If each thread takes same time to process the data and writes at same time, performance may degrade caused by IO preemption. But the different may not big when number of input split is large enough. @shunlean could you please do some test to confirm ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#discussion_r460765555



##########
File path: processing/src/main/java/org/apache/carbondata/processing/loading/sort/impl/UnsafeParallelReadMergeSorterWithColumnRangeImpl.java
##########
@@ -99,6 +101,8 @@ public void initialize(SortParameters sortParameters) {
     UnsafeSortDataRows[] sortDataRows = new UnsafeSortDataRows[columnRangeInfo.getNumOfRanges()];
     intermediateFileMergers = new UnsafeIntermediateMerger[columnRangeInfo.getNumOfRanges()];
     SortParameters[] sortParameterArray = new SortParameters[columnRangeInfo.getNumOfRanges()];
+    this.writeService = Executors.newFixedThreadPool(originSortParameters.getNumberOfCores(),

Review comment:
       @kevinjmh : Yes, If cores are available adding threads horizontally can speedup not just sort, but other steps in data loading also.
   If cores are not available, adding threads vertically also no use as they will end up waiting for cpu.
   
   so, I felt. This PR changes not required and user can increase `carbon.number.of.cores.while.loading`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3847: [CARBONDATA-3906] Optimize sort performance in writting file

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3847:
URL: https://github.com/apache/carbondata/pull/3847#discussion_r460765555



##########
File path: processing/src/main/java/org/apache/carbondata/processing/loading/sort/impl/UnsafeParallelReadMergeSorterWithColumnRangeImpl.java
##########
@@ -99,6 +101,8 @@ public void initialize(SortParameters sortParameters) {
     UnsafeSortDataRows[] sortDataRows = new UnsafeSortDataRows[columnRangeInfo.getNumOfRanges()];
     intermediateFileMergers = new UnsafeIntermediateMerger[columnRangeInfo.getNumOfRanges()];
     SortParameters[] sortParameterArray = new SortParameters[columnRangeInfo.getNumOfRanges()];
+    this.writeService = Executors.newFixedThreadPool(originSortParameters.getNumberOfCores(),

Review comment:
       @kevinjmh : Yes, If cores are available, adding threads horizontally can speedup not just sort, but other steps in data loading also.
   If cores are not available, adding threads vertically also no use as they will end up waiting for cpu.
   
   so, I felt. This PR changes not required and user can increase `carbon.number.of.cores.while.loading`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


12