marchpure opened a new pull request #3880: URL: https://github.com/apache/carbondata/pull/3880 ### Why is this PR needed? During filter segments flow, there are a lot of LIST.CONTAINS, which has heavy time overhead when there are tens of thousands segments. For example, if there are 50000 segments. it will trigger LIST.CONTAINS for each segment, the LIST also has about 50000 elements. so the time complexity will be O(50000 * 50000 ) ### What changes were proposed in this PR? Change List.CONTAINS to MAP.containsKEY ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - No ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
CarbonDataQA1 commented on pull request #3880: URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669675655 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1884/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3880: URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669677665 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3623/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3880: URL: https://github.com/apache/carbondata/pull/3880#discussion_r466243016 ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments, List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) { + Map<String, Segment> validSegmentsMap = validSegments.stream() Review comment: creating a map everytime for query is also overhead when the valid segments is in thousands, I suggest we can use `SET` for `validSegments` when it is formed originally instead of 'LIST' ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
marchpure commented on a change in pull request #3880: URL: https://github.com/apache/carbondata/pull/3880#discussion_r466265360 ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments, List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) { + Map<String, Segment> validSegmentsMap = validSegments.stream() Review comment: Appreciate for that good suggestion. But here we can only use Map instead of Set. Reason: The The pseudo-code for this function is shown as below. // **0. segments.hashcode is segmentno, to when we compare 2 segments, only segmentno will be compared** if (validSegments.contains(segmentInSplit)) { **1. fetch the segment from validSegments** Segment segmentInValidSegment <- fetch the segment from validSegments 2. segmentInSplit.setLoadMetadataDetails(segmentInValidSegment.getLoadMetadataDetails); } For SET. it's hard to fetch the segment from validSegments. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3880: URL: https://github.com/apache/carbondata/pull/3880#discussion_r466270144 ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments, List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) { + Map<String, Segment> validSegmentsMap = validSegments.stream() Review comment: if you see the `equals` implementation of `Segment.java`, it is based on only segment number comparison. so, I think you can still use `SET` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
marchpure commented on a change in pull request #3880: URL: https://github.com/apache/carbondata/pull/3880#discussion_r466275337 ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments, List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) { + Map<String, Segment> validSegmentsMap = validSegments.stream() Review comment: Yeah. But if use SET. it's hard to read a element with specified segmentno. In this code. we need to read a segment with specified segmentno from validSegments. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3880: URL: https://github.com/apache/carbondata/pull/3880#discussion_r466277616 ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments, List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) { + Map<String, Segment> validSegmentsMap = validSegments.stream() Review comment: oh, I got what you mean, we still need to get the element from valid segment to read its LoadMetadataDetails. So, SET can help only for contains, but not for getting the valid segment. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on pull request #3880: URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669819275 LGTM ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
asfgit closed pull request #3880: URL: https://github.com/apache/carbondata/pull/3880 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
Free forum by Nabble | Edit this page |