[GitHub] [carbondata] marchpure opened a new pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure opened a new pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox

marchpure opened a new pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880


    ### Why is this PR needed?
   During filter segments flow, there are a lot of LIST.CONTAINS, which has heavy time overhead when there are tens of thousands segments.
   
   For example, if there are 50000 segments. it will trigger LIST.CONTAINS  for each segment, the LIST also has about 50000 elements. so the time complexity will be O(50000 * 50000 )
   
    ### What changes were proposed in this PR?
   Change List.CONTAINS to MAP.containsKEY
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - No
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox

CarbonDataQA1 commented on pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669675655


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1884/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669677665


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3623/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466243016



##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon
 
   public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
       List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+    Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
       creating a map everytime for query is also overhead when the valid segments is in thousands, I suggest we can use `SET` for `validSegments` when it is formed originally instead of 'LIST'




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox
In reply to this post by GitBox

marchpure commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466265360



##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon
 
   public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
       List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+    Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
       Appreciate for that good suggestion. But here we can only use Map instead of Set.
   
   Reason: The The pseudo-code for this function is shown as below.
   
           // **0. segments.hashcode is segmentno, to when we compare 2 segments, only segmentno will be compared**
           if (validSegments.contains(segmentInSplit)) {
     **1. fetch the segment from validSegments**
     Segment segmentInValidSegment <- fetch the segment from validSegments
     2.
             segmentInSplit.setLoadMetadataDetails(segmentInValidSegment.getLoadMetadataDetails);
           }
   
   For SET. it's hard to fetch the segment from validSegments.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466270144



##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon
 
   public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
       List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+    Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
       if you see the `equals` implementation of `Segment.java`, it is based on only segment number comparison. so, I think you can still use `SET`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox
In reply to this post by GitBox

marchpure commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466275337



##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon
 
   public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
       List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+    Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
       Yeah. But if use SET. it's hard to read a element with specified segmentno.
   In this code. we need to read a segment with specified segmentno from validSegments.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466277616



##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon
 
   public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
       List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+    Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
       oh, I got what you mean, we still need to get the element from valid segment to read its LoadMetadataDetails. So, SET can help only for contains, but not for getting the valid segment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669819275


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] asfgit closed pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox
In reply to this post by GitBox

asfgit closed pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]