Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2822: [CARBONDATA-3014] Added support for inverted ...

Classic

List

Threaded

74 messages Options

1234

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2822

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9133/

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2822

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1067/

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2822

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/870/

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2822

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/871/

---

qiuchenjian-2

[GitHub] carbondata pull request #2822: [CARBONDATA-3014] Added support for inverted ...

In reply to this post by qiuchenjian-2

Github user kunal642 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2822#discussion_r226301860

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/scanner/impl/BlockletFilterScanner.java ---
@@ -316,4 +320,167 @@ private BlockletScannedResult executeFilter(RawBlockletColumnChunks rawBlockletC
readTime.getCount() + dimensionReadTime);
return scannedResult;
}
+
+ /**
+ * This method will process the data in below order
+ * 1. first apply min max on the filter tree and check whether any of the filter
+ * is fall on the range of min max, if not then return empty result
+ * 2. If filter falls on min max range then apply filter on actual
+ * data and get the pruned pages.
+ * 3. if pruned pages are not empty then read only those blocks(measure or dimension)
+ * which was present in the query but not present in the filter, as while applying filter
+ * some of the blocks where already read and present in chunk holder so not need to
+ * read those blocks again, this is to avoid reading of same blocks which was already read
+ * 4. Set the blocks and filter pages to scanned result
+ *
+ * @param rawBlockletColumnChunks blocklet raw chunk of all columns
+ * @throws FilterUnsupportedException
+ */
+ private BlockletScannedResult executeFilterForPages(
+ RawBlockletColumnChunks rawBlockletColumnChunks)
+ throws FilterUnsupportedException, IOException {
+ long startTime = System.currentTimeMillis();
+ QueryStatistic totalBlockletStatistic = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.TOTAL_BLOCKLET_NUM);
+ totalBlockletStatistic.addCountStatistic(QueryStatisticsConstants.TOTAL_BLOCKLET_NUM,
+ totalBlockletStatistic.getCount() + 1);
+ // apply filter on actual data, for each page
+ BitSet pages = this.filterExecuter.prunePages(rawBlockletColumnChunks);
+ // if filter result is empty then return with empty result
+ if (pages.isEmpty()) {
+ CarbonUtil.freeMemory(rawBlockletColumnChunks.getDimensionRawColumnChunks(),
+ rawBlockletColumnChunks.getMeasureRawColumnChunks());
+
+ QueryStatistic scanTime = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.SCAN_BLOCKlET_TIME);
+ scanTime.addCountStatistic(QueryStatisticsConstants.SCAN_BLOCKlET_TIME,
+ scanTime.getCount() + (System.currentTimeMillis() - startTime));
+
+ QueryStatistic scannedPages = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.PAGE_SCANNED);
+ scannedPages.addCountStatistic(QueryStatisticsConstants.PAGE_SCANNED,
+ scannedPages.getCount());
+ return createEmptyResult();
+ }
+
+ BlockletScannedResult scannedResult =
+ new FilterQueryScannedResult(blockExecutionInfo, queryStatisticsModel);
+
+ // valid scanned blocklet
+ QueryStatistic validScannedBlockletStatistic = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.VALID_SCAN_BLOCKLET_NUM);
+ validScannedBlockletStatistic
+ .addCountStatistic(QueryStatisticsConstants.VALID_SCAN_BLOCKLET_NUM,
+ validScannedBlockletStatistic.getCount() + 1);
+ // adding statistics for valid number of pages
+ QueryStatistic validPages = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.VALID_PAGE_SCANNED);
+ validPages.addCountStatistic(QueryStatisticsConstants.VALID_PAGE_SCANNED,
+ validPages.getCount() + pages.cardinality());
+ QueryStatistic scannedPages = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.PAGE_SCANNED);
+ scannedPages.addCountStatistic(QueryStatisticsConstants.PAGE_SCANNED,
+ scannedPages.getCount() + pages.cardinality());
+ // get the row indexes from bit set for each page
+ int[] pageFilteredPages = new int[pages.cardinality()];
+ int index = 0;
+ for (int i = pages.nextSetBit(0); i >= 0; i = pages.nextSetBit(i + 1)) {
+ pageFilteredPages[index++] = i;
+ }
+ // count(*) case there would not be any dimensions are measures selected.
+ int[] numberOfRows = new int[pages.cardinality()];
+ for (int i = 0; i < numberOfRows.length; i++) {
+ numberOfRows[i] = rawBlockletColumnChunks.getDataBlock().getPageRowCount(i);
--- End diff --

This will fill the numberofrows for the pages incorrectly. I think it should be
for (int i = pages.nextSetBit(0); i >= 0; i = pages.nextSetBit(i + 1)) {
pageFilteredPages[index] = i;
numberOfRows[index++] = rawBlockletColumnChunks.getDataBlock().getPageRowCount(i);
}

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata pull request #2822: [CARBONDATA-3014] Added support for inverted ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2822#discussion_r226830306

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/scanner/impl/BlockletFilterScanner.java ---
@@ -316,4 +320,167 @@ private BlockletScannedResult executeFilter(RawBlockletColumnChunks rawBlockletC
readTime.getCount() + dimensionReadTime);
return scannedResult;
}
+
+ /**
+ * This method will process the data in below order
+ * 1. first apply min max on the filter tree and check whether any of the filter
+ * is fall on the range of min max, if not then return empty result
+ * 2. If filter falls on min max range then apply filter on actual
+ * data and get the pruned pages.
+ * 3. if pruned pages are not empty then read only those blocks(measure or dimension)
+ * which was present in the query but not present in the filter, as while applying filter
+ * some of the blocks where already read and present in chunk holder so not need to
+ * read those blocks again, this is to avoid reading of same blocks which was already read
+ * 4. Set the blocks and filter pages to scanned result
+ *
+ * @param rawBlockletColumnChunks blocklet raw chunk of all columns
+ * @throws FilterUnsupportedException
+ */
+ private BlockletScannedResult executeFilterForPages(
+ RawBlockletColumnChunks rawBlockletColumnChunks)
+ throws FilterUnsupportedException, IOException {
+ long startTime = System.currentTimeMillis();
+ QueryStatistic totalBlockletStatistic = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.TOTAL_BLOCKLET_NUM);
+ totalBlockletStatistic.addCountStatistic(QueryStatisticsConstants.TOTAL_BLOCKLET_NUM,
+ totalBlockletStatistic.getCount() + 1);
+ // apply filter on actual data, for each page
+ BitSet pages = this.filterExecuter.prunePages(rawBlockletColumnChunks);
+ // if filter result is empty then return with empty result
+ if (pages.isEmpty()) {
+ CarbonUtil.freeMemory(rawBlockletColumnChunks.getDimensionRawColumnChunks(),
+ rawBlockletColumnChunks.getMeasureRawColumnChunks());
+
+ QueryStatistic scanTime = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.SCAN_BLOCKlET_TIME);
+ scanTime.addCountStatistic(QueryStatisticsConstants.SCAN_BLOCKlET_TIME,
+ scanTime.getCount() + (System.currentTimeMillis() - startTime));
+
+ QueryStatistic scannedPages = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.PAGE_SCANNED);
+ scannedPages.addCountStatistic(QueryStatisticsConstants.PAGE_SCANNED,
+ scannedPages.getCount());
+ return createEmptyResult();
+ }
+
+ BlockletScannedResult scannedResult =
+ new FilterQueryScannedResult(blockExecutionInfo, queryStatisticsModel);
+
+ // valid scanned blocklet
+ QueryStatistic validScannedBlockletStatistic = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.VALID_SCAN_BLOCKLET_NUM);
+ validScannedBlockletStatistic
+ .addCountStatistic(QueryStatisticsConstants.VALID_SCAN_BLOCKLET_NUM,
+ validScannedBlockletStatistic.getCount() + 1);
+ // adding statistics for valid number of pages
+ QueryStatistic validPages = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.VALID_PAGE_SCANNED);
+ validPages.addCountStatistic(QueryStatisticsConstants.VALID_PAGE_SCANNED,
+ validPages.getCount() + pages.cardinality());
+ QueryStatistic scannedPages = queryStatisticsModel.getStatisticsTypeAndObjMap()
+ .get(QueryStatisticsConstants.PAGE_SCANNED);
+ scannedPages.addCountStatistic(QueryStatisticsConstants.PAGE_SCANNED,
+ scannedPages.getCount() + pages.cardinality());
+ // get the row indexes from bit set for each page
+ int[] pageFilteredPages = new int[pages.cardinality()];
+ int index = 0;
+ for (int i = pages.nextSetBit(0); i >= 0; i = pages.nextSetBit(i + 1)) {
+ pageFilteredPages[index++] = i;
+ }
+ // count(*) case there would not be any dimensions are measures selected.
+ int[] numberOfRows = new int[pages.cardinality()];
+ for (int i = 0; i < numberOfRows.length; i++) {
+ numberOfRows[i] = rawBlockletColumnChunks.getDataBlock().getPageRowCount(i);
--- End diff --

ok

---

qiuchenjian-2

[GitHub] carbondata pull request #2822: [CARBONDATA-3014] Added support for inverted ...

In reply to this post by qiuchenjian-2

Github user ravipesala closed the pull request at:

https://github.com/apache/carbondata/pull/2822

---

qiuchenjian-2

[GitHub] carbondata pull request #2822: [CARBONDATA-3014] Added support for inverted ...

In reply to this post by qiuchenjian-2

GitHub user ravipesala reopened a pull request:

https://github.com/apache/carbondata/pull/2822

[CARBONDATA-3014] Added support for inverted index and delete delta for direct scan queries

This PR depends on PR https://github.com/apache/carbondata/pull/2820

Added new classes to support inverted index and delete delta directly from column vector.
`ColumnarVectorWrapperDirectWithInvertedIndex`
`ColumnarVectorWrapperDirectWithDeleteDelta`
`ColumnarVectorWrapperDirectWithDeleteDeltaAndInvertedIndex`

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?

- [ ] Any backward compatibility impacted?

- [ ] Document update required?

- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata perf-inverted-index

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2822.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2822

----
commit 42dfd6adec741e9cf98af92e8a4c3d7810a681e8
Author: ravipesala <ravi.pesala@...>
Date: 2018-10-16T05:02:18Z

Add carbon property to configure vector based row pruning push down

commit d9ae60c8f7b0b90d6b5a113043c5ec4cd3acf726
Author: ravipesala <ravi.pesala@...>
Date: 2018-10-16T06:00:43Z

Added support for full scan queries for vector direct fill.

commit ff36f4b55f26732b6a669fcd2edd4e958a04818a
Author: ravipesala <ravi.pesala@...>
Date: 2018-10-21T13:44:11Z

Fix comments

commit 12878a2591795e53826f615dc54fc3d443227a41
Author: ravipesala <ravi.pesala@...>
Date: 2018-10-16T09:23:14Z

Added support for pruning pages for vector direct fill.

commit 12bed1a2b875962a90621af6f638a41e7e3f6d4f
Author: ravipesala <ravi.pesala@...>
Date: 2018-10-21T15:27:50Z

Fix comments

commit 1b08711b3c88539267363735884884499f5586f8
Author: ravipesala <ravi.pesala@...>
Date: 2018-10-16T11:07:18Z

Added support for inverted index and delete delta for direct scan queries

----

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2822

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2822

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2822

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2822

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2822

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2822

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2822: [CARBONDATA-3014] Added support for inverted index a...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2822

retest this please

---

1234