Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2725: [WIP] Added code to support storing min max f...

Classic

List

50 messages Options

Options

123

[GitHub] carbondata pull request #2725: [WIP] Added code to support storing min max f...

GitHub user manishgupta88 opened a pull request:

https://github.com/apache/carbondata/pull/2725

[WIP] Added code to support storing min max for string columns based on number of characacters

- [ ] Any interfaces changed?

- [ ] Any backward compatibility impacted?

- [ ] Document update required?

- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manishgupta88/carbondata string_min_max_decision

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2725

----
commit 6ef6d880e68aa3bfc1c704659c916b384e5d545a
Author: manishgupta88 <tomanishgupta18@...>
Date: 2018-09-14T05:13:20Z

Modified code to support writing the min max flag for all the columns in the metadata. This will help in deciding whether min max for a column is written or not

commit 4ed24f3972cb4c03133912ba8a1c9d35b3122e7a
Author: manishgupta88 <tomanishgupta18@...>
Date: 2018-09-15T06:20:03Z

Modified code to support filter query using min max flag for a column

----

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/296/

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8541/

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/471/

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/298/

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8543/

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/473/

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/302/

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8547/

---

[GitHub] carbondata issue #2725: [WIP] Added code to support storing min max for stri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/477/

---

[GitHub] carbondata issue #2725: [CARBONDATA-2942] Add read and write support for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/311/

---

[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2725#discussion_r218048262

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/DataRefNode.java ---
@@ -125,4 +125,12 @@ DimensionRawColumnChunk readDimensionChunk(FileReader fileReader, int columnInde
* @return
*/
BitSetGroup getIndexedData();
+
+ /**
+ * Return the array which contains the flag for each column whether the min max for that column
+ * is written in metadata or not
+ *
+ * @return
+ */
+ boolean[] isMinMaxSet();
--- End diff --

Better change name to minMaxFlagArray`

---

[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2725#discussion_r218059894

--- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataRefNode.java ---
@@ -133,6 +134,21 @@
return null;
}

+ @Override
+ public boolean[] isMinMaxSet() {
+ BlockletIndex blockletIndex =
+ blockInfos.get(index).getDetailInfo().getBlockletInfo().getBlockletIndex();
--- End diff --

No need to pass this information from driver to executor.

---

[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2725#discussion_r218060271

--- Diff: core/src/main/java/org/apache/carbondata/core/metadata/blocklet/index/BlockletIndex.java ---
@@ -75,4 +80,12 @@ public void setMinMaxIndex(BlockletMinMaxIndex minMaxIndex) {
this.minMaxIndex = minMaxIndex;
}

+ @Override public void write(DataOutput out) throws IOException {
--- End diff --

Please don't serialize as it is not required to pass to executor

---

[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2725#discussion_r218061484

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java ---
@@ -108,10 +108,13 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks,
BitSetGroup bitSetGroup = new BitSetGroup(dimensionRawColumnChunk.getPagesCount());
filterValues = dimColumnExecuterInfo.getFilterKeys();
boolean isDecoded = false;
+ boolean isMinMaxSetForFilterDimension =
+ rawBlockletColumnChunks.getDataBlock().isMinMaxSet()[chunkIndex];
for (int i = 0; i < dimensionRawColumnChunk.getPagesCount(); i++) {
if (dimensionRawColumnChunk.getMaxValues() != null) {
- if (isScanRequired(dimensionRawColumnChunk.getMaxValues()[i],
- dimensionRawColumnChunk.getMinValues()[i], dimColumnExecuterInfo.getFilterKeys())) {
+ if (!isMinMaxSetForFilterDimension || isScanRequired(
--- End diff --

Please use the minaxflags from the current page and do the scan require check

---

[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2725#discussion_r218062043

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeGrtThanFiterExecuterImpl.java ---
@@ -194,9 +198,12 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks,
BitSetGroup bitSetGroup = new BitSetGroup(rawColumnChunk.getPagesCount());
FilterExecuter filterExecuter = null;
boolean isExclude = false;
+ boolean isMinMaxSetForFilterDimension =
+ rawBlockletColumnChunks.getDataBlock().isMinMaxSet()[chunkIndex];
for (int i = 0; i < rawColumnChunk.getPagesCount(); i++) {
if (rawColumnChunk.getMaxValues() != null) {
- if (isScanRequired(rawColumnChunk.getMaxValues()[i], this.filterRangeValues)) {
+ if (!isMinMaxSetForFilterDimension || isScanRequired(rawColumnChunk.getMaxValues()[i],
--- End diff --

Please use the minaxflags from the current page and do the scan require check

---

[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2725#discussion_r218062180

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/scanner/impl/BlockletFilterScanner.java ---
@@ -122,11 +122,11 @@ public boolean isScanRequired(DataRefNode dataBlock) {
bitSet = ((ImplicitColumnFilterExecutor) filterExecuter)
.isFilterValuesPresentInBlockOrBlocklet(
dataBlock.getColumnsMaxValue(),
- dataBlock.getColumnsMinValue(), blockletId);
+ dataBlock.getColumnsMinValue(), blockletId, dataBlock.isMinMaxSet());
} else {
bitSet = this.filterExecuter
.isScanRequired(dataBlock.getColumnsMaxValue(),
- dataBlock.getColumnsMinValue());
+ dataBlock.getColumnsMinValue(), dataBlock.isMinMaxSet());
--- End diff --

Please use the minaxflags from the current footer and do the scan require check

---

[GitHub] carbondata issue #2725: [CARBONDATA-2942] Add read and write support for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/487/

---

[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2725#discussion_r218066737

--- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonMetadataUtil.java ---
@@ -250,6 +275,42 @@ public static BlockletIndex getBlockletIndex(EncodedBlocklet encodedBlocklet,
}

/**
+ * This method will combine the writeMinMax flag from all the pages. If any page for a given
+ * dimension has writeMinMax flag set to false then min max for that dimension will nto be
+ * written in any of the page and metadata
+ *
+ * @param blockletMinMaxIndex
+ * @param encodedBlocklet
+ */
+ private static List<Boolean> mergeWriteMinMaxFlagForAllPages(
+ BlockletMinMaxIndex blockletMinMaxIndex, EncodedBlocklet encodedBlocklet) {
+ Boolean[] mergedWriteMinMaxFlag =
+ new Boolean[encodedBlocklet.getNumberOfDimension() + encodedBlocklet.getNumberOfMeasure()];
+ // set writeMinMax flag to true for all the columns by default and then update if stats object
+ // has the this flag set to false
+ Arrays.fill(mergedWriteMinMaxFlag, true);
+ for (int i = 0; i < encodedBlocklet.getNumberOfDimension(); i++) {
+ for (int pageIndex = 0; pageIndex < encodedBlocklet.getNumberOfPages(); pageIndex++) {
+ EncodedColumnPage encodedColumnPage =
+ encodedBlocklet.getEncodedDimensionColumnPages().get(i).getEncodedColumnPageList()
+ .get(pageIndex);
+ SimpleStatsResult stats = encodedColumnPage.getStats();
+ if (!stats.writeMinMax()) {
+ mergedWriteMinMaxFlag[i] = stats.writeMinMax();
+ String columnName = encodedColumnPage.getActualPage().getColumnSpec().getFieldName();
+ LOGGER.info("Min Max writing ignored for column " + columnName + " from page 0 to "
--- End diff --

Add generic message saying that `"Min Max writing of blocklet ignored for column with name " + columnName + " `

---

[GitHub] carbondata issue #2725: [CARBONDATA-2942] Add read and write support for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2725

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8557/

---

123