Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1497/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9754/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1706/ --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:
https://github.com/apache/carbondata/pull/2936 @manishgupta88 , @ravipesala : please review --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1498/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1707/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9755/ --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:
https://github.com/apache/carbondata/pull/2936 retest this please --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1500/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1709/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9757/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9762/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1714/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1505/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2936#discussion_r235611496 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java --- @@ -120,37 +132,166 @@ public BlockletDetailsFetcher getBlockletDetailsFetcher() { * @param filterExp * @return */ - public List<ExtendedBlocklet> prune(List<Segment> segments, FilterResolverIntf filterExp, - List<PartitionSpec> partitions) throws IOException { - List<ExtendedBlocklet> blocklets = new ArrayList<>(); - SegmentProperties segmentProperties; - Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments); + public List<ExtendedBlocklet> prune(List<Segment> segments, final FilterResolverIntf filterExp, + final List<PartitionSpec> partitions) throws IOException { + final List<ExtendedBlocklet> blocklets = new ArrayList<>(); + final Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments); + // for non-filter queries + if (filterExp == null) { + // if filter is not passed, then return all the blocklets. + return pruneWithoutFilter(segments, partitions, blocklets); --- End diff -- Please check what is the time taken to get all blocks in case of millions of files. If it takes more time then we may need to parallelize this also. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2936#discussion_r235611698 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java --- @@ -120,37 +132,166 @@ public BlockletDetailsFetcher getBlockletDetailsFetcher() { * @param filterExp * @return */ - public List<ExtendedBlocklet> prune(List<Segment> segments, FilterResolverIntf filterExp, - List<PartitionSpec> partitions) throws IOException { - List<ExtendedBlocklet> blocklets = new ArrayList<>(); - SegmentProperties segmentProperties; - Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments); + public List<ExtendedBlocklet> prune(List<Segment> segments, final FilterResolverIntf filterExp, + final List<PartitionSpec> partitions) throws IOException { + final List<ExtendedBlocklet> blocklets = new ArrayList<>(); + final Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments); + // for non-filter queries + if (filterExp == null) { + // if filter is not passed, then return all the blocklets. + return pruneWithoutFilter(segments, partitions, blocklets); + } + // for filter queries + int totalFiles = 0; + boolean isBlockDataMapType = true; + for (Segment segment : segments) { + for (DataMap dataMap : dataMaps.get(segment)) { + if (!(dataMap instanceof BlockDataMap)) { --- End diff -- This flow can be used by all datamaps why only for blockdatamap? --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2936#discussion_r235612449 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java --- @@ -120,37 +132,166 @@ public BlockletDetailsFetcher getBlockletDetailsFetcher() { * @param filterExp * @return */ - public List<ExtendedBlocklet> prune(List<Segment> segments, FilterResolverIntf filterExp, - List<PartitionSpec> partitions) throws IOException { - List<ExtendedBlocklet> blocklets = new ArrayList<>(); - SegmentProperties segmentProperties; - Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments); + public List<ExtendedBlocklet> prune(List<Segment> segments, final FilterResolverIntf filterExp, + final List<PartitionSpec> partitions) throws IOException { + final List<ExtendedBlocklet> blocklets = new ArrayList<>(); + final Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments); + // for non-filter queries + if (filterExp == null) { + // if filter is not passed, then return all the blocklets. + return pruneWithoutFilter(segments, partitions, blocklets); + } + // for filter queries + int totalFiles = 0; + boolean isBlockDataMapType = true; + for (Segment segment : segments) { + for (DataMap dataMap : dataMaps.get(segment)) { + if (!(dataMap instanceof BlockDataMap)) { + isBlockDataMapType = false; + break; + } + totalFiles += ((BlockDataMap) dataMap).getTotalBlocks(); + } + if (!isBlockDataMapType) { + // totalFiles fill be 0 for non-BlockDataMap Type. ex: lucene, bloom datamap. use old flow. + break; + } + } + int numOfThreadsForPruning = getNumOfThreadsForPruning(); + int filesPerEachThread = totalFiles / numOfThreadsForPruning; + if (numOfThreadsForPruning == 1 || filesPerEachThread == 1 + || segments.size() < numOfThreadsForPruning || totalFiles + < CarbonCommonConstants.CARBON_DRIVER_PRUNING_MULTI_THREAD_ENABLE_FILES_COUNT) { + // use multi-thread, only if the files are more than 0.1 million. + // As 0.1 million files block pruning can take only 1 second. + // Doing multi-thread for smaller values is not recommended as + // driver should have minimum threads opened to support multiple concurrent queries. + return pruneWithFilter(segments, filterExp, partitions, blocklets, dataMaps); + } + // handle by multi-thread + return pruneWithFilterMultiThread(segments, filterExp, partitions, blocklets, dataMaps, + totalFiles); + } + + private List<ExtendedBlocklet> pruneWithoutFilter(List<Segment> segments, + List<PartitionSpec> partitions, List<ExtendedBlocklet> blocklets) throws IOException { + for (Segment segment : segments) { + List<Blocklet> allBlocklets = blockletDetailsFetcher.getAllBlocklets(segment, partitions); + blocklets.addAll( + addSegmentId(blockletDetailsFetcher.getExtendedBlocklets(allBlocklets, segment), + segment.toString())); + } + return blocklets; + } + + private List<ExtendedBlocklet> pruneWithFilter(List<Segment> segments, + FilterResolverIntf filterExp, List<PartitionSpec> partitions, + List<ExtendedBlocklet> blocklets, Map<Segment, List<DataMap>> dataMaps) throws IOException { for (Segment segment : segments) { List<Blocklet> pruneBlocklets = new ArrayList<>(); - // if filter is not passed then return all the blocklets - if (filterExp == null) { - pruneBlocklets = blockletDetailsFetcher.getAllBlocklets(segment, partitions); - } else { - segmentProperties = segmentPropertiesFetcher.getSegmentProperties(segment); - for (DataMap dataMap : dataMaps.get(segment)) { - pruneBlocklets.addAll(dataMap.prune(filterExp, segmentProperties, partitions)); + SegmentProperties segmentProperties = segmentPropertiesFetcher.getSegmentProperties(segment); + for (DataMap dataMap : dataMaps.get(segment)) { + pruneBlocklets.addAll(dataMap.prune(filterExp, segmentProperties, partitions)); + } + blocklets.addAll( + addSegmentId(blockletDetailsFetcher.getExtendedBlocklets(pruneBlocklets, segment), + segment.toString())); + } + return blocklets; + } + + private List<ExtendedBlocklet> pruneWithFilterMultiThread(List<Segment> segments, + final FilterResolverIntf filterExp, final List<PartitionSpec> partitions, + List<ExtendedBlocklet> blocklets, final Map<Segment, List<DataMap>> dataMaps, + int totalFiles) { + int numOfThreadsForPruning = getNumOfThreadsForPruning(); + int filesPerEachThread = (int) Math.ceil((double)totalFiles / numOfThreadsForPruning); + int prev = 0; + int filesCount = 0; + int processedFileCount = 0; + List<List<Segment>> segmentList = new ArrayList<>(); --- End diff -- I feel it is better splitting should happen as per datamaps not segments. One segment can have million files in case of big load, so please try parallel execution of datamap pruning at datamap level --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2936#discussion_r235615112 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java --- @@ -120,37 +132,166 @@ public BlockletDetailsFetcher getBlockletDetailsFetcher() { * @param filterExp * @return */ - public List<ExtendedBlocklet> prune(List<Segment> segments, FilterResolverIntf filterExp, - List<PartitionSpec> partitions) throws IOException { - List<ExtendedBlocklet> blocklets = new ArrayList<>(); - SegmentProperties segmentProperties; - Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments); + public List<ExtendedBlocklet> prune(List<Segment> segments, final FilterResolverIntf filterExp, + final List<PartitionSpec> partitions) throws IOException { + final List<ExtendedBlocklet> blocklets = new ArrayList<>(); + final Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments); + // for non-filter queries + if (filterExp == null) { + // if filter is not passed, then return all the blocklets. + return pruneWithoutFilter(segments, partitions, blocklets); --- End diff -- yes, Already tested this. for 100k files with filter takes around 1 seconds. But without filter is 50 ms. Very less. Hence not handled. for filters, pruning was taking time. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1509/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2936 Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9767/ --- |
Free forum by Nabble | Edit this page |