[GitHub] [carbondata] marchpure opened a new pull request #3816: [CARBONDATA-3879] Filtering Segmets Optimazation

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure opened a new pull request #3816: [CARBONDATA-3879] Filtering Segmets Optimazation

GitBox

marchpure opened a new pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816


    ### Why is this PR needed?
    During filter segments flow, there are a lot of LIST.CONTAINS, which has heavy time overhead when there are tens of thousands segments.
    For example, if there are 50000 segments. it will trigger LIST.CONTAINS  for each segment, the LIST also has about 50000 elements. so the time complexity will be O(50000 * 50000 )
   
    ### What changes were proposed in this PR?
   Change List.CONTAINS to MAP.containsKEY
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - No
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651264615


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3261/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651265133


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1525/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651350898






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] niuge01 commented on a change in pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

niuge01 commented on a change in pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#discussion_r447366192



##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -232,47 +233,40 @@
   private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments,
       boolean validationRequired, ReadCommittedScope readCommittedScope) {
     Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope);
-    List<Segment> segmentToAccessSet =
-        new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess)));
-    List<Segment> filteredSegmentToAccess = new ArrayList<>();
+    Map<String, Segment> filteredSegmentToAccess = new HashMap<>();

Review comment:
       Move this line to line 244

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -232,47 +233,40 @@
   private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments,
       boolean validationRequired, ReadCommittedScope readCommittedScope) {
     Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope);
-    List<Segment> segmentToAccessSet =
-        new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess)));
-    List<Segment> filteredSegmentToAccess = new ArrayList<>();
+    Map<String, Segment> filteredSegmentToAccess = new HashMap<>();
     if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) {
-      filteredSegmentToAccess.addAll(validSegments);
+      return validSegments;
     } else {
+      Map<String, Segment> segmentToAccessMap = new HashMap<>();

Review comment:
       should initialize map capacity with segmentsToAssess size.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on a change in pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

QiangCai commented on a change in pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#discussion_r447373586



##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -232,47 +233,40 @@
   private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments,
       boolean validationRequired, ReadCommittedScope readCommittedScope) {
     Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope);
-    List<Segment> segmentToAccessSet =
-        new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess)));
-    List<Segment> filteredSegmentToAccess = new ArrayList<>();
     if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) {
-      filteredSegmentToAccess.addAll(validSegments);
+      return validSegments;
     } else {

Review comment:
       remove this line

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -232,47 +233,40 @@
   private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments,
       boolean validationRequired, ReadCommittedScope readCommittedScope) {
     Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope);
-    List<Segment> segmentToAccessSet =
-        new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess)));
-    List<Segment> filteredSegmentToAccess = new ArrayList<>();
     if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) {
-      filteredSegmentToAccess.addAll(validSegments);
+      return validSegments;
     } else {
+      Map<String, Segment> segmentToAccessMap = new HashMap<>(segmentsToAccess.length);
+      for (Segment segment : segmentsToAccess) {
+        segmentToAccessMap.put(segment.getSegmentNo(), segment);
+      }
+      Map<String, Segment> filteredSegmentToAccess = new HashMap<>();
       for (Segment validSegment : validSegments) {
-        int index = segmentToAccessSet.indexOf(validSegment);
-        if (index > -1) {
-          // In case of in progress reading segment, segment file name is set to the property itself
-          if (segmentToAccessSet.get(index).getSegmentFileName() != null
-              && validSegment.getSegmentFileName() == null) {
-            filteredSegmentToAccess.add(segmentToAccessSet.get(index));
+        String segmentNoOfValidSegment = validSegment.getSegmentNo();
+        if (segmentToAccessMap.containsKey(segmentNoOfValidSegment)) {
+          if (segmentToAccessMap.get(segmentNoOfValidSegment)
+              .getSegmentFileName() != null && validSegment.getSegmentFileName() == null) {
+            filteredSegmentToAccess.put(segmentNoOfValidSegment,
+                    segmentToAccessMap.get(segmentNoOfValidSegment));
           } else {
-            filteredSegmentToAccess.add(validSegment);
-          }
-        }
-      }
-      if (filteredSegmentToAccess.size() != segmentToAccessSet.size() && !validationRequired) {
-        for (Segment segment : segmentToAccessSet) {
-          if (!filteredSegmentToAccess.contains(segment)) {
-            filteredSegmentToAccess.add(segment);
+            filteredSegmentToAccess.put(segmentNoOfValidSegment, validSegment);
           }

Review comment:
       String validSegmentNo = validSegment.getSegmentNo();
         Segment segmentToAccess = segmentToAccessMap.get(validSegmentNo);
         if (null != segmentToAccess) {
           if (validSegment.getSegmentFileName() != null ||
               segmentToAccess.getSegmentFileName() == null) {
             segmentToAccess = validSegment;
           }
           filteredSegmentToAccess.put(validSegmentNo, segmentToAccess);
         }

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -232,47 +233,40 @@
   private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments,
       boolean validationRequired, ReadCommittedScope readCommittedScope) {
     Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope);
-    List<Segment> segmentToAccessSet =
-        new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess)));
-    List<Segment> filteredSegmentToAccess = new ArrayList<>();
     if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) {
-      filteredSegmentToAccess.addAll(validSegments);
+      return validSegments;
     } else {
+      Map<String, Segment> segmentToAccessMap = new HashMap<>(segmentsToAccess.length);
+      for (Segment segment : segmentsToAccess) {
+        segmentToAccessMap.put(segment.getSegmentNo(), segment);
+      }
+      Map<String, Segment> filteredSegmentToAccess = new HashMap<>();
       for (Segment validSegment : validSegments) {
-        int index = segmentToAccessSet.indexOf(validSegment);
-        if (index > -1) {
-          // In case of in progress reading segment, segment file name is set to the property itself
-          if (segmentToAccessSet.get(index).getSegmentFileName() != null
-              && validSegment.getSegmentFileName() == null) {
-            filteredSegmentToAccess.add(segmentToAccessSet.get(index));
+        String segmentNoOfValidSegment = validSegment.getSegmentNo();
+        if (segmentToAccessMap.containsKey(segmentNoOfValidSegment)) {
+          if (segmentToAccessMap.get(segmentNoOfValidSegment)
+              .getSegmentFileName() != null && validSegment.getSegmentFileName() == null) {
+            filteredSegmentToAccess.put(segmentNoOfValidSegment,
+                    segmentToAccessMap.get(segmentNoOfValidSegment));
           } else {
-            filteredSegmentToAccess.add(validSegment);
-          }
-        }
-      }
-      if (filteredSegmentToAccess.size() != segmentToAccessSet.size() && !validationRequired) {
-        for (Segment segment : segmentToAccessSet) {
-          if (!filteredSegmentToAccess.contains(segment)) {
-            filteredSegmentToAccess.add(segment);
+            filteredSegmentToAccess.put(segmentNoOfValidSegment, validSegment);
           }
         }
       }
-      // TODO: add validation for set segments access based on valid segments in table status
-      if (filteredSegmentToAccess.size() != segmentToAccessSet.size() && !validationRequired) {
-        for (Segment segment : segmentToAccessSet) {
-          if (!filteredSegmentToAccess.contains(segment)) {
-            filteredSegmentToAccess.add(segment);
+      if (filteredSegmentToAccess.size() != segmentToAccessMap.size() && !validationRequired) {

Review comment:
       !validationRequired && filteredSegmentToAccess.size() != segmentToAccessMap.size()

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -232,47 +233,40 @@
   private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments,
       boolean validationRequired, ReadCommittedScope readCommittedScope) {
     Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope);
-    List<Segment> segmentToAccessSet =
-        new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess)));
-    List<Segment> filteredSegmentToAccess = new ArrayList<>();
     if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) {
-      filteredSegmentToAccess.addAll(validSegments);
+      return validSegments;
     } else {
+      Map<String, Segment> segmentToAccessMap = new HashMap<>(segmentsToAccess.length);
+      for (Segment segment : segmentsToAccess) {
+        segmentToAccessMap.put(segment.getSegmentNo(), segment);
+      }

Review comment:
       Map<String, Segment> segmentToAccessMap = Arrays.stream(segmentsToAccess).collect(Collectors.toMap(Segment::getSegmentNo, segment -> segment));




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651530149


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3268/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651530514


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651538060


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1533/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651538493


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3269/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651669395


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1535/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651671108


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3271/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on a change in pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

QiangCai commented on a change in pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#discussion_r447634675



##########
File path: core/src/main/java/org/apache/carbondata/core/index/TableIndex.java
##########
@@ -206,7 +206,8 @@ public CarbonTable getTable() {
       Set<Path> partitionLocations, List<ExtendedBlocklet> blocklets,
       Map<Segment, List<Index>> indexes) throws IOException {
     for (Segment segment : segments) {
-      if (indexes.get(segment).isEmpty() || indexes.get(segment) == null) {
+      if (segment == null ||
+          indexes.get(segment).isEmpty() || indexes.get(segment) == null) {

Review comment:
       change to:
   segment == null || indexes.get(segment) == null || indexes.get(segment).isEmpty()




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651763251


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3275/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651763696


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1539/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

marchpure commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651771840


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3816: [CARBONDATA-3879] Filtering Segments Optimazation

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3816:
URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651858622


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3278/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]