marchpure opened a new pull request #3816: URL: https://github.com/apache/carbondata/pull/3816 ### Why is this PR needed? During filter segments flow, there are a lot of LIST.CONTAINS, which has heavy time overhead when there are tens of thousands segments. For example, if there are 50000 segments. it will trigger LIST.CONTAINS for each segment, the LIST also has about 50000 elements. so the time complexity will be O(50000 * 50000 ) ### What changes were proposed in this PR? Change List.CONTAINS to MAP.containsKEY ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - No ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651264615 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3261/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651265133 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1525/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651350898 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
niuge01 commented on a change in pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#discussion_r447366192 ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -232,47 +233,40 @@ private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments, boolean validationRequired, ReadCommittedScope readCommittedScope) { Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope); - List<Segment> segmentToAccessSet = - new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess))); - List<Segment> filteredSegmentToAccess = new ArrayList<>(); + Map<String, Segment> filteredSegmentToAccess = new HashMap<>(); Review comment: Move this line to line 244 ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -232,47 +233,40 @@ private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments, boolean validationRequired, ReadCommittedScope readCommittedScope) { Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope); - List<Segment> segmentToAccessSet = - new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess))); - List<Segment> filteredSegmentToAccess = new ArrayList<>(); + Map<String, Segment> filteredSegmentToAccess = new HashMap<>(); if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) { - filteredSegmentToAccess.addAll(validSegments); + return validSegments; } else { + Map<String, Segment> segmentToAccessMap = new HashMap<>(); Review comment: should initialize map capacity with segmentsToAssess size. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
QiangCai commented on a change in pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#discussion_r447373586 ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -232,47 +233,40 @@ private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments, boolean validationRequired, ReadCommittedScope readCommittedScope) { Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope); - List<Segment> segmentToAccessSet = - new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess))); - List<Segment> filteredSegmentToAccess = new ArrayList<>(); if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) { - filteredSegmentToAccess.addAll(validSegments); + return validSegments; } else { Review comment: remove this line ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -232,47 +233,40 @@ private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments, boolean validationRequired, ReadCommittedScope readCommittedScope) { Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope); - List<Segment> segmentToAccessSet = - new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess))); - List<Segment> filteredSegmentToAccess = new ArrayList<>(); if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) { - filteredSegmentToAccess.addAll(validSegments); + return validSegments; } else { + Map<String, Segment> segmentToAccessMap = new HashMap<>(segmentsToAccess.length); + for (Segment segment : segmentsToAccess) { + segmentToAccessMap.put(segment.getSegmentNo(), segment); + } + Map<String, Segment> filteredSegmentToAccess = new HashMap<>(); for (Segment validSegment : validSegments) { - int index = segmentToAccessSet.indexOf(validSegment); - if (index > -1) { - // In case of in progress reading segment, segment file name is set to the property itself - if (segmentToAccessSet.get(index).getSegmentFileName() != null - && validSegment.getSegmentFileName() == null) { - filteredSegmentToAccess.add(segmentToAccessSet.get(index)); + String segmentNoOfValidSegment = validSegment.getSegmentNo(); + if (segmentToAccessMap.containsKey(segmentNoOfValidSegment)) { + if (segmentToAccessMap.get(segmentNoOfValidSegment) + .getSegmentFileName() != null && validSegment.getSegmentFileName() == null) { + filteredSegmentToAccess.put(segmentNoOfValidSegment, + segmentToAccessMap.get(segmentNoOfValidSegment)); } else { - filteredSegmentToAccess.add(validSegment); - } - } - } - if (filteredSegmentToAccess.size() != segmentToAccessSet.size() && !validationRequired) { - for (Segment segment : segmentToAccessSet) { - if (!filteredSegmentToAccess.contains(segment)) { - filteredSegmentToAccess.add(segment); + filteredSegmentToAccess.put(segmentNoOfValidSegment, validSegment); } Review comment: String validSegmentNo = validSegment.getSegmentNo(); Segment segmentToAccess = segmentToAccessMap.get(validSegmentNo); if (null != segmentToAccess) { if (validSegment.getSegmentFileName() != null || segmentToAccess.getSegmentFileName() == null) { segmentToAccess = validSegment; } filteredSegmentToAccess.put(validSegmentNo, segmentToAccess); } ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -232,47 +233,40 @@ private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments, boolean validationRequired, ReadCommittedScope readCommittedScope) { Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope); - List<Segment> segmentToAccessSet = - new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess))); - List<Segment> filteredSegmentToAccess = new ArrayList<>(); if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) { - filteredSegmentToAccess.addAll(validSegments); + return validSegments; } else { + Map<String, Segment> segmentToAccessMap = new HashMap<>(segmentsToAccess.length); + for (Segment segment : segmentsToAccess) { + segmentToAccessMap.put(segment.getSegmentNo(), segment); + } + Map<String, Segment> filteredSegmentToAccess = new HashMap<>(); for (Segment validSegment : validSegments) { - int index = segmentToAccessSet.indexOf(validSegment); - if (index > -1) { - // In case of in progress reading segment, segment file name is set to the property itself - if (segmentToAccessSet.get(index).getSegmentFileName() != null - && validSegment.getSegmentFileName() == null) { - filteredSegmentToAccess.add(segmentToAccessSet.get(index)); + String segmentNoOfValidSegment = validSegment.getSegmentNo(); + if (segmentToAccessMap.containsKey(segmentNoOfValidSegment)) { + if (segmentToAccessMap.get(segmentNoOfValidSegment) + .getSegmentFileName() != null && validSegment.getSegmentFileName() == null) { + filteredSegmentToAccess.put(segmentNoOfValidSegment, + segmentToAccessMap.get(segmentNoOfValidSegment)); } else { - filteredSegmentToAccess.add(validSegment); - } - } - } - if (filteredSegmentToAccess.size() != segmentToAccessSet.size() && !validationRequired) { - for (Segment segment : segmentToAccessSet) { - if (!filteredSegmentToAccess.contains(segment)) { - filteredSegmentToAccess.add(segment); + filteredSegmentToAccess.put(segmentNoOfValidSegment, validSegment); } } } - // TODO: add validation for set segments access based on valid segments in table status - if (filteredSegmentToAccess.size() != segmentToAccessSet.size() && !validationRequired) { - for (Segment segment : segmentToAccessSet) { - if (!filteredSegmentToAccess.contains(segment)) { - filteredSegmentToAccess.add(segment); + if (filteredSegmentToAccess.size() != segmentToAccessMap.size() && !validationRequired) { Review comment: !validationRequired && filteredSegmentToAccess.size() != segmentToAccessMap.size() ########## File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ########## @@ -232,47 +233,40 @@ private List<Segment> getFilteredSegment(JobContext job, List<Segment> validSegments, boolean validationRequired, ReadCommittedScope readCommittedScope) { Segment[] segmentsToAccess = getSegmentsToAccess(job, readCommittedScope); - List<Segment> segmentToAccessSet = - new ArrayList<>(new HashSet<>(Arrays.asList(segmentsToAccess))); - List<Segment> filteredSegmentToAccess = new ArrayList<>(); if (segmentsToAccess.length == 0 || segmentsToAccess[0].getSegmentNo().equalsIgnoreCase("*")) { - filteredSegmentToAccess.addAll(validSegments); + return validSegments; } else { + Map<String, Segment> segmentToAccessMap = new HashMap<>(segmentsToAccess.length); + for (Segment segment : segmentsToAccess) { + segmentToAccessMap.put(segment.getSegmentNo(), segment); + } Review comment: Map<String, Segment> segmentToAccessMap = Arrays.stream(segmentsToAccess).collect(Collectors.toMap(Segment::getSegmentNo, segment -> segment)); ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651530149 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3268/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651530514 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1532/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651538060 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1533/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651538493 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3269/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651669395 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1535/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651671108 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3271/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
QiangCai commented on a change in pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#discussion_r447634675 ########## File path: core/src/main/java/org/apache/carbondata/core/index/TableIndex.java ########## @@ -206,7 +206,8 @@ public CarbonTable getTable() { Set<Path> partitionLocations, List<ExtendedBlocklet> blocklets, Map<Segment, List<Index>> indexes) throws IOException { for (Segment segment : segments) { - if (indexes.get(segment).isEmpty() || indexes.get(segment) == null) { + if (segment == null || + indexes.get(segment).isEmpty() || indexes.get(segment) == null) { Review comment: change to: segment == null || indexes.get(segment) == null || indexes.get(segment).isEmpty() ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651763251 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3275/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651763696 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1539/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
marchpure commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651771840 retest this please ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3816: URL: https://github.com/apache/carbondata/pull/3816#issuecomment-651858622 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3278/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
Free forum by Nabble | Edit this page |