shenjiayu17 opened a new pull request #3986: URL: https://github.com/apache/carbondata/pull/3986 ### Why is this PR needed? The horizontal compaction flow will be too slow when updating with lots of segments(or lots of blocks), so we try to analyze and optimize it for time-consuming problem. ### What changes were proposed in this PR? In performDeleteDeltaCompaction, optimize the method getSegListIUDCompactionQualified. Combine two traversals of segments which have same process, and move listFiles process outside the traversal of blocks. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - No ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709323928 Can one of the admins verify this patch? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709454528 Add to whitelist ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709457343 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2712/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
marchpure commented on a change in pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#discussion_r505670979 ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg, return blockLists; } + private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg, + SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) { + + List<String> blockLists = new ArrayList<>(); + + Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = + segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg); + + List<String> blockNameList = + segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo()); + + Set<String> uniqueBlocks = new HashSet<String>(); + for (final String blockName : blockNameList) { + + List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName); + + if (null != deleteDeltaFiles) { + for (CarbonFile blocks : deleteDeltaFiles) { Review comment: if (delteDeltaFiles.size < threshold) continue ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -1039,22 +1039,10 @@ private static boolean isSegmentValid(LoadMetadataDetails seg) { if (CompactionType.IUD_DELETE_DELTA == compactionTypeIUD) { int numberDeleteDeltaFilesThreshold = CarbonProperties.getInstance().getNoDeleteDeltaFilesThresholdForIUDCompaction(); - List<Segment> deleteSegments = new ArrayList<>(); for (Segment seg : segments) { - if (checkDeleteDeltaFilesInSeg(seg, segmentUpdateStatusManager, Review comment: remove checkDeleteDeltaFilesInSeg function ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java ########## @@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) { return null; } + public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) { Review comment: remove getDeleteDeltaFilesList function ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg, return blockLists; } + private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg, + SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) { + + List<String> blockLists = new ArrayList<>(); + + Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = + segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg); + + List<String> blockNameList = + segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo()); + + Set<String> uniqueBlocks = new HashSet<String>(); + for (final String blockName : blockNameList) { + + List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName); + + if (null != deleteDeltaFiles) { + for (CarbonFile blocks : deleteDeltaFiles) { Review comment: blocks -> block ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java ########## @@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) { return null; } + public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) { + String segmentPath = CarbonTablePath.getSegmentPath( + identifier.getTablePath(), seg.getSegmentNo()); + CarbonFile segDir = FileFactory.getCarbonFile(segmentPath); + CarbonFile[] allDeleteDeltaFilesOfSegment = segDir.listFiles(new CarbonFileFilter() { + @Override + public boolean accept(CarbonFile pathName) { + String fileName = pathName.getName(); + return (pathName.getSize() > 0) && Review comment: getSize() will trigger one S3 IO. remove getsSize() ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java ########## @@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) { return null; } + public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) { + String segmentPath = CarbonTablePath.getSegmentPath( + identifier.getTablePath(), seg.getSegmentNo()); Review comment: if SegmentUpdateDetails donot contains seg, we shall return empty result directly. which can save a lot of IO overhead ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg, return blockLists; } + private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg, + SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) { + + List<String> blockLists = new ArrayList<>(); + + Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = + segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg); + + List<String> blockNameList = + segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo()); + + Set<String> uniqueBlocks = new HashSet<String>(); + for (final String blockName : blockNameList) { + + List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName); + + if (null != deleteDeltaFiles) { + for (CarbonFile blocks : deleteDeltaFiles) { + String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName()); + String timestamp = + CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName()); + String taskAndTimeStamp = task + "-" + timestamp; + uniqueBlocks.add(taskAndTimeStamp); Review comment: if (uniqueBlocks.size() > numberDeltaFilesThreshold) { blockLists.add(seg.getSegmentNo() + "/" + blockName); break; } ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709459321 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4466/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
marchpure commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709459684 checkstyle failes. you can have a checkstyple test in your local env by run: mvn clean install -DskipTests metions: don't use mvn clean package -DskipTests ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
marchpure commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709674242 add some logs and comments. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Zhangshunyu commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709698583 Have you ever tested this optimization? Could you pls give a comparison result for this change? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
shenjiayu17 commented on a change in pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020436 ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -1039,22 +1039,10 @@ private static boolean isSegmentValid(LoadMetadataDetails seg) { if (CompactionType.IUD_DELETE_DELTA == compactionTypeIUD) { int numberDeleteDeltaFilesThreshold = CarbonProperties.getInstance().getNoDeleteDeltaFilesThresholdForIUDCompaction(); - List<Segment> deleteSegments = new ArrayList<>(); for (Segment seg : segments) { - if (checkDeleteDeltaFilesInSeg(seg, segmentUpdateStatusManager, Review comment: Done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
shenjiayu17 commented on a change in pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020565 ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg, return blockLists; } + private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg, + SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) { + + List<String> blockLists = new ArrayList<>(); + + Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = + segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg); + + List<String> blockNameList = + segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo()); + + Set<String> uniqueBlocks = new HashSet<String>(); + for (final String blockName : blockNameList) { + + List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName); + + if (null != deleteDeltaFiles) { + for (CarbonFile blocks : deleteDeltaFiles) { + String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName()); + String timestamp = + CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName()); + String taskAndTimeStamp = task + "-" + timestamp; + uniqueBlocks.add(taskAndTimeStamp); Review comment: Done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
shenjiayu17 commented on a change in pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020781 ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg, return blockLists; } + private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg, + SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) { + + List<String> blockLists = new ArrayList<>(); + + Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = + segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg); + + List<String> blockNameList = + segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo()); + + Set<String> uniqueBlocks = new HashSet<String>(); + for (final String blockName : blockNameList) { + + List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName); + + if (null != deleteDeltaFiles) { + for (CarbonFile blocks : deleteDeltaFiles) { Review comment: Done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
shenjiayu17 commented on a change in pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020858 ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg, return blockLists; } + private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg, + SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) { + + List<String> blockLists = new ArrayList<>(); + + Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = + segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg); + + List<String> blockNameList = + segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo()); + + Set<String> uniqueBlocks = new HashSet<String>(); + for (final String blockName : blockNameList) { + + List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName); + + if (null != deleteDeltaFiles) { + for (CarbonFile blocks : deleteDeltaFiles) { Review comment: Done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
shenjiayu17 commented on a change in pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020950 ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java ########## @@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) { return null; } + public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) { + String segmentPath = CarbonTablePath.getSegmentPath( + identifier.getTablePath(), seg.getSegmentNo()); + CarbonFile segDir = FileFactory.getCarbonFile(segmentPath); + CarbonFile[] allDeleteDeltaFilesOfSegment = segDir.listFiles(new CarbonFileFilter() { + @Override + public boolean accept(CarbonFile pathName) { + String fileName = pathName.getName(); + return (pathName.getSize() > 0) && Review comment: Done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709707004 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4474/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709707683 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2720/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709711202 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4475/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709712170 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2721/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709785844 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4476/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709805701 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2722/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
Free forum by Nabble | Edit this page |