[GitHub] [carbondata] shenjiayu17 opened a new pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

classic Classic list List threaded Threaded
103 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox

shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510064286



##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
   }
 
   /**
-   * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
-   * if the number of delete delta files present in the segment is more than
-   * numberDeltaFilesThreshold.
+   * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+   * i.e., if the number of delete delta files present in the segment is more than
+   * numberDeltaFilesThreshold, this segment will be selected.
    *
-   * @param seg
-   * @param segmentUpdateStatusManager
-   * @param numberDeltaFilesThreshold
-   * @return
+   * @param seg segment to be qualified
+   * @param segmentUpdateStatusManager segments & blocks details management
+   * @param numberDeltaFilesThreshold threshold of delete delta files
+   * @return block list of the segment
    */
-  private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+  private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
       SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
 
+    List<String> blockLists = new ArrayList<>();
     Set<String> uniqueBlocks = new HashSet<String>();
     List<String> blockNameList =
         segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
-    for (final String blockName : blockNameList) {
-
-      CarbonFile[] deleteDeltaFiles =
+    for (String blockName : blockNameList) {
+      List<String> deleteDeltaFiles =
           segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
-      if (null != deleteDeltaFiles) {
-        // The Delete Delta files may have Spill over blocks. Will consider multiple spill over

Review comment:
       Recovered the existing comments. I will pay attention to it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510065227



##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
   }
 
   /**
-   * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
-   * if the number of delete delta files present in the segment is more than
-   * numberDeltaFilesThreshold.
+   * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+   * i.e., if the number of delete delta files present in the segment is more than
+   * numberDeltaFilesThreshold, this segment will be selected.
    *
-   * @param seg
-   * @param segmentUpdateStatusManager
-   * @param numberDeltaFilesThreshold
-   * @return
+   * @param seg segment to be qualified
+   * @param segmentUpdateStatusManager segments & blocks details management
+   * @param numberDeltaFilesThreshold threshold of delete delta files
+   * @return block list of the segment
    */
-  private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+  private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
       SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
 
+    List<String> blockLists = new ArrayList<>();
     Set<String> uniqueBlocks = new HashSet<String>();
     List<String> blockNameList =
         segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
-    for (final String blockName : blockNameList) {
-
-      CarbonFile[] deleteDeltaFiles =
+    for (String blockName : blockNameList) {
+      List<String> deleteDeltaFiles =
           segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
-      if (null != deleteDeltaFiles) {
-        // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
-        // blocks as one. Currently DeleteDeltaFiles array contains Delete Delta Block name which
-        // lies within Delete Delta Start TimeStamp and End TimeStamp. In order to eliminate
-        // Spill Over Blocks will choose files with unique taskID.
-        for (CarbonFile blocks : deleteDeltaFiles) {
-          // Get Task ID and the Timestamp from the Block name for e.g.
-          // part-0-3-1481084721319.carbondata => "3-1481084721319"
-          String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
-          String timestamp =
-              CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
-          String taskAndTimeStamp = task + "-" + timestamp;
+      if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
+        for (String file : deleteDeltaFiles) {

Review comment:
       It was formatted




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510065323



##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,20 +1197,16 @@ public static boolean isHorizontalCompactionEnabled() {
     // set the update status.
     segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
 
-    CarbonFile[] deleteDeltaFiles =
+    List<String> deleteFilePathList =
         segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
 
     String destFileName =
         blockName + "-" + timestamp.toString() + CarbonCommonConstants.DELETE_DELTA_FILE_EXT;
-    List<String> deleteFilePathList = new ArrayList<>();
-    if (null != deleteDeltaFiles && deleteDeltaFiles.length > 0 && null != deleteDeltaFiles[0]
-        .getParentFile()) {
-      String fullBlockFilePath = deleteDeltaFiles[0].getParentFile().getCanonicalPath()
-          + CarbonCommonConstants.FILE_SEPARATOR + destFileName;
-
-      for (CarbonFile cFile : deleteDeltaFiles) {
-        deleteFilePathList.add(cFile.getCanonicalPath());
-      }
+    if (deleteFilePathList.size() > 0) {
+      String deleteDeltaFilePath = deleteFilePathList.get(0);
+      String fullBlockFilePath =
+          deleteDeltaFilePath.substring(0, deleteDeltaFilePath.lastIndexOf("/")) +

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510123816



##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
   }
 
   /**
-   * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
-   * if the number of delete delta files present in the segment is more than
-   * numberDeltaFilesThreshold.
+   * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+   * i.e., if the number of delete delta files present in the segment is more than
+   * numberDeltaFilesThreshold, this segment will be selected.
    *
-   * @param seg
-   * @param segmentUpdateStatusManager
-   * @param numberDeltaFilesThreshold
-   * @return
+   * @param seg segment to be qualified
+   * @param segmentUpdateStatusManager segments & blocks details management
+   * @param numberDeltaFilesThreshold threshold of delete delta files
+   * @return block list of the segment
    */
-  private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+  private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
       SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
 
+    List<String> blockLists = new ArrayList<>();
     Set<String> uniqueBlocks = new HashSet<String>();
     List<String> blockNameList =
         segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
-    for (final String blockName : blockNameList) {
-
-      CarbonFile[] deleteDeltaFiles =
+    for (String blockName : blockNameList) {
+      List<String> deleteDeltaFiles =
           segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
-      if (null != deleteDeltaFiles) {
-        // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
-        // blocks as one. Currently DeleteDeltaFiles array contains Delete Delta Block name which
-        // lies within Delete Delta Start TimeStamp and End TimeStamp. In order to eliminate
-        // Spill Over Blocks will choose files with unique taskID.
-        for (CarbonFile blocks : deleteDeltaFiles) {
-          // Get Task ID and the Timestamp from the Block name for e.g.
-          // part-0-3-1481084721319.carbondata => "3-1481084721319"
-          String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
-          String timestamp =
-              CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
-          String taskAndTimeStamp = task + "-" + timestamp;
+      if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
+        for (String file : deleteDeltaFiles) {

Review comment:
       i gave comment to change the variable name and remove spaces after and before :, please check




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-714490521


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2873/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-714520638


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4629/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510677675



##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
   }
 
   /**
-   * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
-   * if the number of delete delta files present in the segment is more than
-   * numberDeltaFilesThreshold.
+   * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+   * i.e., if the number of delete delta files present in the segment is more than
+   * numberDeltaFilesThreshold, this segment will be selected.
    *
-   * @param seg
-   * @param segmentUpdateStatusManager
-   * @param numberDeltaFilesThreshold
-   * @return
+   * @param seg segment to be qualified
+   * @param segmentUpdateStatusManager segments & blocks details management
+   * @param numberDeltaFilesThreshold threshold of delete delta files
+   * @return block list of the segment
    */
-  private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+  private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
       SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
 
+    List<String> blockLists = new ArrayList<>();
     Set<String> uniqueBlocks = new HashSet<String>();
     List<String> blockNameList =
         segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
-    for (final String blockName : blockNameList) {
-
-      CarbonFile[] deleteDeltaFiles =
+    for (String blockName : blockNameList) {
+      List<String> deleteDeltaFiles =
           segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
-      if (null != deleteDeltaFiles) {
-        // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
-        // blocks as one. Currently DeleteDeltaFiles array contains Delete Delta Block name which
-        // lies within Delete Delta Start TimeStamp and End TimeStamp. In order to eliminate
-        // Spill Over Blocks will choose files with unique taskID.
-        for (CarbonFile blocks : deleteDeltaFiles) {
-          // Get Task ID and the Timestamp from the Block name for e.g.
-          // part-0-3-1481084721319.carbondata => "3-1481084721319"
-          String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
-          String timestamp =
-              CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
-          String taskAndTimeStamp = task + "-" + timestamp;
+      if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
+        for (String file : deleteDeltaFiles) {

Review comment:
       formatted and modified the variable name




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-715196563


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4660/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-715197754


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2905/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

akashrn5 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-715336355






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

shenjiayu17 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717024597


   We tested with 10 segments total 10G and update more than 20 times to see the cost, update row count of each time is 176973.
   The result and comparison before and after improving is shown below,  we can see that time-consuming of UPDATE reduces by about 30% after improving horizontal compaction.
   
   |   | Time Before Improving | Time after Improving |
   | ------- | ------------- | ------------- |
   | Average time <br> from 2nd to 21st UPDATE | **99.19** | **67.55** |
   | UPDATE  |     |     |
   | 1st     | 38.397  |  71.91   |
   | 2nd     | 74.918  | 67.386   |
   | 3rd     | 43.851  | 46.171   |
   | 4th     | 52.133  | 82.974   |
   | 5th     | 86.021  | 44.499   |
   | 6th     | 62.992  | 39.973   |
   | 7th     | 99.077  | 69.292   |
   | 8th     | 77.815  | 60.416   |
   | 9th     | 74.949  | 50.187   |
   | 10th     | 85.874  | 50.973   |
   | 11th     | 103.902  | 58.005   |
   | 12th     | 77.534  | 86.087   |
   | 13th     | 79.154  | 76.283   |
   | 14th     | 116.117  | 72.029   |
   | 15th     | 112.846  | 74.182   |
   | 16th     | 124.707  | 69.282   |
   | 17th     | 124.677  | 67.546   |
   | 18th     | 147.15   | 66.652   |
   | 19th     | 135.127  | 109.385   |
   | 20th     | 133.94   | 80.849   |
   | 21st     | 171.112  | 78.92  |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

Zhangshunyu commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717025630


   > We tested with 10 segments total 10G and update more than 20 times to see the cost, update row count of each time is 176973.
   > The result and comparison before and after improving is shown below, we can see that time-consuming of UPDATE reduces by about 30% after improving horizontal compaction.
   >
   > Time Before Improving Time after Improving
   > Average time
   > from 2nd to 21st UPDATE **99.19** **67.55**
   > UPDATE
   > 1st 38.397 71.91
   > 2nd 74.918 67.386
   > 3rd 43.851 46.171
   > 4th 52.133 82.974
   > 5th 86.021 44.499
   > 6th 62.992 39.973
   > 7th 99.077 69.292
   > 8th 77.815 60.416
   > 9th 74.949 50.187
   > 10th 85.874 50.973
   > 11th 103.902 58.005
   > 12th 77.534 86.087
   > 13th 79.154 76.283
   > 14th 116.117 72.029
   > 15th 112.846 74.182
   > 16th 124.707 69.282
   > 17th 124.677 67.546
   > 18th 147.15 66.652
   > 19th 135.127 109.385
   > 20th 133.94 80.849
   > 21st 171.112 78.92
   
   greate!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

Zhangshunyu commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717027548


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

akashrn5 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717054658


   > We tested with 10 segments total 10G and update more than 20 times to see the cost, update row count of each time is 176973.
   > The result and comparison before and after improving is shown below, we can see that time-consuming of UPDATE reduces by about 30% after improving horizontal compaction.
   >
   > Time Before Improving Time after Improving
   > Average time
   > from 2nd to 21st UPDATE **99.19** **67.55**
   > UPDATE
   > 1st 38.397 71.91
   > 2nd 74.918 67.386
   > 3rd 43.851 46.171
   > 4th 52.133 82.974
   > 5th 86.021 44.499
   > 6th 62.992 39.973
   > 7th 99.077 69.292
   > 8th 77.815 60.416
   > 9th 74.949 50.187
   > 10th 85.874 50.973
   > 11th 103.902 58.005
   > 12th 77.534 86.087
   > 13th 79.154 76.283
   > 14th 116.117 72.029
   > 15th 112.846 74.182
   > 16th 124.707 69.282
   > 17th 124.677 67.546
   > 18th 147.15 66.652
   > 19th 135.127 109.385
   > 20th 133.94 80.849
   > 21st 171.112 78.92
   
   Cool


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

QiangCai commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513147277



##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files of the block of specified segment.
+   * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+   *
+   * @param seg the segment which is to find block and its delete delta files
+   * @param blockName the specified block of the segment
+   * @return delete delta file list of the block
    */
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+  public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+    List<String> deleteDeltaFileList = new ArrayList<>();
     String segmentPath = CarbonTablePath.getSegmentPath(
-        identifier.getTablePath(), segmentId.getSegmentNo());
-    CarbonFile segDir =
-        FileFactory.getCarbonFile(segmentPath);
+        identifier.getTablePath(), seg.getSegmentNo());
+
     for (SegmentUpdateDetails block : updateDetails) {
       if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
-          (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
-          && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
-        final long deltaStartTimestamp =
-            getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-        final long deltaEndTimeStamp =
-            getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
-        return segDir.listFiles(new CarbonFileFilter() {
-
-          @Override
-          public boolean accept(CarbonFile pathName) {
-            String fileName = pathName.getName();
-            if (pathName.getSize() > 0
-                && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
-              String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
-              long timestamp =
-                  Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
-              return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
-                  && timestamp >= deltaStartTimestamp;
-            }
-            return false;
-          }
-        });
+          (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+          !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+        Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+        String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+            blockName + CarbonCommonConstants.HYPHEN;
+        if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {

Review comment:
       please add test case:
   1.  set CarbonCommonConstants.CARBON_HORIZONTAL_COMPACTION_ENABLE to false
   2. update different row in same file three times (it should generate three delete delta files for same blockname)
   3. set CarbonCommonConstants.CARBON_HORIZONTAL_COMPACTION_ENABLE to true
   4. update different row in same file again
   5. query table to check result

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files of the block of specified segment.
+   * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+   *
+   * @param seg the segment which is to find block and its delete delta files
+   * @param blockName the specified block of the segment
+   * @return delete delta file list of the block
    */
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+  public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+    List<String> deleteDeltaFileList = new ArrayList<>();
     String segmentPath = CarbonTablePath.getSegmentPath(
-        identifier.getTablePath(), segmentId.getSegmentNo());
-    CarbonFile segDir =
-        FileFactory.getCarbonFile(segmentPath);
+        identifier.getTablePath(), seg.getSegmentNo());
+
     for (SegmentUpdateDetails block : updateDetails) {
       if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
-          (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
-          && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
-        final long deltaStartTimestamp =
-            getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-        final long deltaEndTimeStamp =
-            getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
-        return segDir.listFiles(new CarbonFileFilter() {
-
-          @Override
-          public boolean accept(CarbonFile pathName) {
-            String fileName = pathName.getName();
-            if (pathName.getSize() > 0
-                && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
-              String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
-              long timestamp =
-                  Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
-              return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
-                  && timestamp >= deltaStartTimestamp;
-            }
-            return false;
-          }
-        });
+          (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+          !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+        Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+        String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+            blockName + CarbonCommonConstants.HYPHEN;
+        if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
+          deltaFileTimestamps.forEach(timestamp -> deleteDeltaFileList.add(
+              deleteDeltaFilePrefix + timestamp + CarbonCommonConstants.DELETE_DELTA_FILE_EXT));

Review comment:
       better to reuse CarbonUpdateUtil.getDeleteDeltaFilePath
   

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -177,6 +178,7 @@ object HorizontalCompaction {
       absTableIdentifier,
       segmentUpdateStatusManager,
       compactionTypeIUD)
+    LOG.debug(s"The segment list for Horizontal Update Compaction is $deletedBlocksList")

Review comment:
       same above

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files of the block of specified segment.
+   * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+   *
+   * @param seg the segment which is to find block and its delete delta files
+   * @param blockName the specified block of the segment
+   * @return delete delta file list of the block
    */
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+  public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {

Review comment:
       change seg to segment

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -130,6 +130,7 @@ object HorizontalCompaction {
       absTableIdentifier,
       segmentUpdateStatusManager,
       compactionTypeIUD)
+    LOG.debug(s"The segment list for Horizontal Update Compaction is $validSegList")

Review comment:
       avoid converting the list to string
   if (LOG.isDebugEnabled) {
     ...
   }

##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,43 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
   }
 
   /**
-   * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
-   * if the number of delete delta files present in the segment is more than
-   * numberDeltaFilesThreshold.
+   * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+   * i.e., if the number of delete delta files present in the segment is more than
+   * numberDeltaFilesThreshold, this segment will be selected.
    *
-   * @param seg
-   * @param segmentUpdateStatusManager
-   * @param numberDeltaFilesThreshold
-   * @return
+   * @param seg segment to be qualified
+   * @param segmentUpdateStatusManager segments & blocks details management
+   * @param numberDeltaFilesThreshold threshold of delete delta files
+   * @return block list of the segment
    */
-  private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+  private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
       SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
 
+    List<String> blockLists = new ArrayList<>();
     Set<String> uniqueBlocks = new HashSet<String>();
     List<String> blockNameList =
         segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
-    for (final String blockName : blockNameList) {
-
-      CarbonFile[] deleteDeltaFiles =
+    for (String blockName : blockNameList) {
+      List<String> deleteDeltaFiles =
           segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
-      if (null != deleteDeltaFiles) {
+      if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {

Review comment:
       add test case for case when CarbonCommonConstants.DELETE_DELTAFILE_COUNT_THRESHOLD_IUD_COMPACTION  >= 3




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513248146



##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files of the block of specified segment.
+   * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+   *
+   * @param seg the segment which is to find block and its delete delta files
+   * @param blockName the specified block of the segment
+   * @return delete delta file list of the block
    */
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+  public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+    List<String> deleteDeltaFileList = new ArrayList<>();
     String segmentPath = CarbonTablePath.getSegmentPath(
-        identifier.getTablePath(), segmentId.getSegmentNo());
-    CarbonFile segDir =
-        FileFactory.getCarbonFile(segmentPath);
+        identifier.getTablePath(), seg.getSegmentNo());
+
     for (SegmentUpdateDetails block : updateDetails) {
       if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
-          (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
-          && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
-        final long deltaStartTimestamp =
-            getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-        final long deltaEndTimeStamp =
-            getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
-        return segDir.listFiles(new CarbonFileFilter() {
-
-          @Override
-          public boolean accept(CarbonFile pathName) {
-            String fileName = pathName.getName();
-            if (pathName.getSize() > 0
-                && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
-              String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
-              long timestamp =
-                  Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
-              return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
-                  && timestamp >= deltaStartTimestamp;
-            }
-            return false;
-          }
-        });
+          (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+          !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+        Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+        String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+            blockName + CarbonCommonConstants.HYPHEN;
+        if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {

Review comment:
       Added this test case "test IUD Horizontal Compaction after updating without compaction"
   in HorizontalCompactionTestCase




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513248715



##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files of the block of specified segment.
+   * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+   *
+   * @param seg the segment which is to find block and its delete delta files
+   * @param blockName the specified block of the segment
+   * @return delete delta file list of the block
    */
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+  public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+    List<String> deleteDeltaFileList = new ArrayList<>();
     String segmentPath = CarbonTablePath.getSegmentPath(
-        identifier.getTablePath(), segmentId.getSegmentNo());
-    CarbonFile segDir =
-        FileFactory.getCarbonFile(segmentPath);
+        identifier.getTablePath(), seg.getSegmentNo());
+
     for (SegmentUpdateDetails block : updateDetails) {
       if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
-          (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
-          && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
-        final long deltaStartTimestamp =
-            getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-        final long deltaEndTimeStamp =
-            getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
-        return segDir.listFiles(new CarbonFileFilter() {
-
-          @Override
-          public boolean accept(CarbonFile pathName) {
-            String fileName = pathName.getName();
-            if (pathName.getSize() > 0
-                && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
-              String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
-              long timestamp =
-                  Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
-              return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
-                  && timestamp >= deltaStartTimestamp;
-            }
-            return false;
-          }
-        });
+          (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+          !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+        Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+        String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+            blockName + CarbonCommonConstants.HYPHEN;
+        if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
+          deltaFileTimestamps.forEach(timestamp -> deleteDeltaFileList.add(
+              deleteDeltaFilePrefix + timestamp + CarbonCommonConstants.DELETE_DELTA_FILE_EXT));

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513249194



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -130,6 +130,7 @@ object HorizontalCompaction {
       absTableIdentifier,
       segmentUpdateStatusManager,
       compactionTypeIUD)
+    LOG.debug(s"The segment list for Horizontal Update Compaction is $validSegList")

Review comment:
       Done

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -177,6 +178,7 @@ object HorizontalCompaction {
       absTableIdentifier,
       segmentUpdateStatusManager,
       compactionTypeIUD)
+    LOG.debug(s"The segment list for Horizontal Update Compaction is $deletedBlocksList")

Review comment:
       Done

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files of the block of specified segment.
+   * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+   *
+   * @param seg the segment which is to find block and its delete delta files
+   * @param blockName the specified block of the segment
+   * @return delete delta file list of the block
    */
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+  public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513249713



##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,43 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
   }
 
   /**
-   * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
-   * if the number of delete delta files present in the segment is more than
-   * numberDeltaFilesThreshold.
+   * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+   * i.e., if the number of delete delta files present in the segment is more than
+   * numberDeltaFilesThreshold, this segment will be selected.
    *
-   * @param seg
-   * @param segmentUpdateStatusManager
-   * @param numberDeltaFilesThreshold
-   * @return
+   * @param seg segment to be qualified
+   * @param segmentUpdateStatusManager segments & blocks details management
+   * @param numberDeltaFilesThreshold threshold of delete delta files
+   * @return block list of the segment
    */
-  private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+  private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
       SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
 
+    List<String> blockLists = new ArrayList<>();
     Set<String> uniqueBlocks = new HashSet<String>();
     List<String> blockNameList =
         segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
-    for (final String blockName : blockNameList) {
-
-      CarbonFile[] deleteDeltaFiles =
+    for (String blockName : blockNameList) {
+      List<String> deleteDeltaFiles =
           segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
-      if (null != deleteDeltaFiles) {
+      if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {

Review comment:
       Added test case "test IUD Horizontal Compaction when CarbonCommonConstants.DELETE_DELTAFILE_COUNT_THRESHOLD_IUD_COMPACTION >= 3"
   in HorizontalCompactionTestCase




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717880461


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4716/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


123456