Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2936: [WIP] Parallelize block pruning of default da...

Classic

List

Threaded

77 messages Options

1234

qiuchenjian-2

[GitHub] carbondata issue #2936: [WIP] Parallelize block pruning of default datamap i...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2936

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1497/

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2936

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9754/

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2936

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1706/

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on the issue:

https://github.com/apache/carbondata/pull/2936

@manishgupta88 , @ravipesala : please review

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2936

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1498/

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2936

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1707/

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on the issue:

https://github.com/apache/carbondata/pull/2936

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2936

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9762/

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata pull request #2936: [CARBONDATA-3118] Parallelize block pruning o...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2936#discussion_r235611496

--- Diff: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java ---
@@ -120,37 +132,166 @@ public BlockletDetailsFetcher getBlockletDetailsFetcher() {
* @param filterExp
* @return
*/
- public List<ExtendedBlocklet> prune(List<Segment> segments, FilterResolverIntf filterExp,
- List<PartitionSpec> partitions) throws IOException {
- List<ExtendedBlocklet> blocklets = new ArrayList<>();
- SegmentProperties segmentProperties;
- Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments);
+ public List<ExtendedBlocklet> prune(List<Segment> segments, final FilterResolverIntf filterExp,
+ final List<PartitionSpec> partitions) throws IOException {
+ final List<ExtendedBlocklet> blocklets = new ArrayList<>();
+ final Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments);
+ // for non-filter queries
+ if (filterExp == null) {
+ // if filter is not passed, then return all the blocklets.
+ return pruneWithoutFilter(segments, partitions, blocklets);
--- End diff --

Please check what is the time taken to get all blocks in case of millions of files. If it takes more time then we may need to parallelize this also.

---

qiuchenjian-2

[GitHub] carbondata pull request #2936: [CARBONDATA-3118] Parallelize block pruning o...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2936#discussion_r235611698

--- Diff: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java ---
@@ -120,37 +132,166 @@ public BlockletDetailsFetcher getBlockletDetailsFetcher() {
* @param filterExp
* @return
*/
- public List<ExtendedBlocklet> prune(List<Segment> segments, FilterResolverIntf filterExp,
- List<PartitionSpec> partitions) throws IOException {
- List<ExtendedBlocklet> blocklets = new ArrayList<>();
- SegmentProperties segmentProperties;
- Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments);
+ public List<ExtendedBlocklet> prune(List<Segment> segments, final FilterResolverIntf filterExp,
+ final List<PartitionSpec> partitions) throws IOException {
+ final List<ExtendedBlocklet> blocklets = new ArrayList<>();
+ final Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments);
+ // for non-filter queries
+ if (filterExp == null) {
+ // if filter is not passed, then return all the blocklets.
+ return pruneWithoutFilter(segments, partitions, blocklets);
+ }
+ // for filter queries
+ int totalFiles = 0;
+ boolean isBlockDataMapType = true;
+ for (Segment segment : segments) {
+ for (DataMap dataMap : dataMaps.get(segment)) {
+ if (!(dataMap instanceof BlockDataMap)) {
--- End diff --

This flow can be used by all datamaps why only for blockdatamap?

---

qiuchenjian-2

[GitHub] carbondata pull request #2936: [CARBONDATA-3118] Parallelize block pruning o...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2936#discussion_r235612449

--- Diff: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java ---
@@ -120,37 +132,166 @@ public BlockletDetailsFetcher getBlockletDetailsFetcher() {
* @param filterExp
* @return
*/
- public List<ExtendedBlocklet> prune(List<Segment> segments, FilterResolverIntf filterExp,
- List<PartitionSpec> partitions) throws IOException {
- List<ExtendedBlocklet> blocklets = new ArrayList<>();
- SegmentProperties segmentProperties;
- Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments);
+ public List<ExtendedBlocklet> prune(List<Segment> segments, final FilterResolverIntf filterExp,
+ final List<PartitionSpec> partitions) throws IOException {
+ final List<ExtendedBlocklet> blocklets = new ArrayList<>();
+ final Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments);
+ // for non-filter queries
+ if (filterExp == null) {
+ // if filter is not passed, then return all the blocklets.
+ return pruneWithoutFilter(segments, partitions, blocklets);
+ }
+ // for filter queries
+ int totalFiles = 0;
+ boolean isBlockDataMapType = true;
+ for (Segment segment : segments) {
+ for (DataMap dataMap : dataMaps.get(segment)) {
+ if (!(dataMap instanceof BlockDataMap)) {
+ isBlockDataMapType = false;
+ break;
+ }
+ totalFiles += ((BlockDataMap) dataMap).getTotalBlocks();
+ }
+ if (!isBlockDataMapType) {
+ // totalFiles fill be 0 for non-BlockDataMap Type. ex: lucene, bloom datamap. use old flow.
+ break;
+ }
+ }
+ int numOfThreadsForPruning = getNumOfThreadsForPruning();
+ int filesPerEachThread = totalFiles / numOfThreadsForPruning;
+ if (numOfThreadsForPruning == 1 || filesPerEachThread == 1
+ || segments.size() < numOfThreadsForPruning || totalFiles
+ < CarbonCommonConstants.CARBON_DRIVER_PRUNING_MULTI_THREAD_ENABLE_FILES_COUNT) {
+ // use multi-thread, only if the files are more than 0.1 million.
+ // As 0.1 million files block pruning can take only 1 second.
+ // Doing multi-thread for smaller values is not recommended as
+ // driver should have minimum threads opened to support multiple concurrent queries.
+ return pruneWithFilter(segments, filterExp, partitions, blocklets, dataMaps);
+ }
+ // handle by multi-thread
+ return pruneWithFilterMultiThread(segments, filterExp, partitions, blocklets, dataMaps,
+ totalFiles);
+ }
+
+ private List<ExtendedBlocklet> pruneWithoutFilter(List<Segment> segments,
+ List<PartitionSpec> partitions, List<ExtendedBlocklet> blocklets) throws IOException {
+ for (Segment segment : segments) {
+ List<Blocklet> allBlocklets = blockletDetailsFetcher.getAllBlocklets(segment, partitions);
+ blocklets.addAll(
+ addSegmentId(blockletDetailsFetcher.getExtendedBlocklets(allBlocklets, segment),
+ segment.toString()));
+ }
+ return blocklets;
+ }
+
+ private List<ExtendedBlocklet> pruneWithFilter(List<Segment> segments,
+ FilterResolverIntf filterExp, List<PartitionSpec> partitions,
+ List<ExtendedBlocklet> blocklets, Map<Segment, List<DataMap>> dataMaps) throws IOException {
for (Segment segment : segments) {
List<Blocklet> pruneBlocklets = new ArrayList<>();
- // if filter is not passed then return all the blocklets
- if (filterExp == null) {
- pruneBlocklets = blockletDetailsFetcher.getAllBlocklets(segment, partitions);
- } else {
- segmentProperties = segmentPropertiesFetcher.getSegmentProperties(segment);
- for (DataMap dataMap : dataMaps.get(segment)) {
- pruneBlocklets.addAll(dataMap.prune(filterExp, segmentProperties, partitions));
+ SegmentProperties segmentProperties = segmentPropertiesFetcher.getSegmentProperties(segment);
+ for (DataMap dataMap : dataMaps.get(segment)) {
+ pruneBlocklets.addAll(dataMap.prune(filterExp, segmentProperties, partitions));
+ }
+ blocklets.addAll(
+ addSegmentId(blockletDetailsFetcher.getExtendedBlocklets(pruneBlocklets, segment),
+ segment.toString()));
+ }
+ return blocklets;
+ }
+
+ private List<ExtendedBlocklet> pruneWithFilterMultiThread(List<Segment> segments,
+ final FilterResolverIntf filterExp, final List<PartitionSpec> partitions,
+ List<ExtendedBlocklet> blocklets, final Map<Segment, List<DataMap>> dataMaps,
+ int totalFiles) {
+ int numOfThreadsForPruning = getNumOfThreadsForPruning();
+ int filesPerEachThread = (int) Math.ceil((double)totalFiles / numOfThreadsForPruning);
+ int prev = 0;
+ int filesCount = 0;
+ int processedFileCount = 0;
+ List<List<Segment>> segmentList = new ArrayList<>();
--- End diff --

I feel it is better splitting should happen as per datamaps not segments. One segment can have million files in case of big load, so please try parallel execution of datamap pruning at datamap level

---

qiuchenjian-2

[GitHub] carbondata pull request #2936: [CARBONDATA-3118] Parallelize block pruning o...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2936#discussion_r235615112

--- Diff: core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java ---
@@ -120,37 +132,166 @@ public BlockletDetailsFetcher getBlockletDetailsFetcher() {
* @param filterExp
* @return
*/
- public List<ExtendedBlocklet> prune(List<Segment> segments, FilterResolverIntf filterExp,
- List<PartitionSpec> partitions) throws IOException {
- List<ExtendedBlocklet> blocklets = new ArrayList<>();
- SegmentProperties segmentProperties;
- Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments);
+ public List<ExtendedBlocklet> prune(List<Segment> segments, final FilterResolverIntf filterExp,
+ final List<PartitionSpec> partitions) throws IOException {
+ final List<ExtendedBlocklet> blocklets = new ArrayList<>();
+ final Map<Segment, List<DataMap>> dataMaps = dataMapFactory.getDataMaps(segments);
+ // for non-filter queries
+ if (filterExp == null) {
+ // if filter is not passed, then return all the blocklets.
+ return pruneWithoutFilter(segments, partitions, blocklets);
--- End diff --

yes, Already tested this. for 100k files with filter takes around 1 seconds. But without filter is 50 ms. Very less. Hence not handled.

for filters, pruning was taking time.

---

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2936: [CARBONDATA-3118] Parallelize block pruning of defau...

In reply to this post by qiuchenjian-2

1234