Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add direct string encoding ...

Classic

List

Threaded

41 messages Options

123

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1265

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/543/

---

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

Github user jackylk commented on the issue:

https://github.com/apache/carbondata/pull/1265

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1265

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/544/

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137449599

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeGrtThanFiterExecuterImpl.java ---
@@ -366,7 +364,7 @@ private BitSet setFilterdIndexToBitSet(DimensionColumnDataChunk dimensionColumnD
BitSet bitSet = new BitSet(numerOfRows);
byte[][] filterValues = this.filterRangeValues;
// binary search can only be applied if column is sorted
- if (isNaturalSorted) {
+ if (isNaturalSorted && !dimensionColumnDataChunk.isNoDicitionaryColumn()) {
--- End diff --

why this check is required? even no dictionary columns also can be sorted right?

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137449915

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeLessThanEqualFilterExecuterImpl.java ---
@@ -262,10 +259,7 @@ private BitSet getFilteredIndexes(DimensionColumnDataChunk dimensionColumnDataCh
int numerOfRows) {
byte[] defaultValue = null;
if (dimColEvaluatorInfoList.get(0).getDimension().hasEncoding(Encoding.DIRECT_DICTIONARY)) {
- DirectDictionaryGenerator directDictionaryGenerator = DirectDictionaryKeyGeneratorFactory
- .getDirectDictionaryGenerator(
- dimColEvaluatorInfoList.get(0).getDimension().getDataType());
- int key = directDictionaryGenerator.generateDirectSurrogateKey(null) + 1;
+ int key = 0;
--- End diff --

I think it is better to get null key from respective interfaces instead of hard code. if the null value key changes in future also no effect here

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137449951

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeLessThanEqualFilterExecuterImpl.java ---
@@ -367,7 +360,7 @@ private BitSet setFilterdIndexToBitSet(DimensionColumnDataChunk dimensionColumnD
BitSet bitSet = new BitSet(numerOfRows);
byte[][] filterValues = this.filterRangeValues;
// binary search can only be applied if column is sorted
- if (isNaturalSorted) {
+ if (isNaturalSorted && !dimensionColumnDataChunk.isNoDicitionaryColumn()) {
--- End diff --

why this check required?

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137450241

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/partition/EqualToFilterImpl.java ---
@@ -50,7 +50,7 @@ public EqualToFilterImpl(EqualToExpression equalTo, PartitionInfo partitionInfo)
literal.getLiteralExpValue().toString(),
partitionInfo.getColumnSchemaList().get(0).getDataType());
if (PartitionType.RANGE == partitionInfo.getPartitionType() && value instanceof String) {
- value = ByteUtil.toBytes((String)value);
+ value = ByteUtil.toBytesForPlainValue((String)value);
--- End diff --

It seems old overloaded method looks good than this name. It is utility method so better have a overloaded method

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137490368

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/store/ColumnPageWrapper.java ---
@@ -17,46 +17,183 @@

package org.apache.carbondata.core.datastore.chunk.store;

+import java.nio.ByteBuffer;
+import java.util.BitSet;
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.ColumnType;
+import org.apache.carbondata.core.datastore.TableSpec;
import org.apache.carbondata.core.datastore.chunk.DimensionColumnDataChunk;
import org.apache.carbondata.core.datastore.page.ColumnPage;
+import org.apache.carbondata.core.keygenerator.directdictionary.DirectDictionaryGenerator;
+import org.apache.carbondata.core.metadata.datatype.DataType;
import org.apache.carbondata.core.scan.executor.infos.KeyStructureInfo;
+import org.apache.carbondata.core.scan.result.vector.CarbonColumnVector;
import org.apache.carbondata.core.scan.result.vector.ColumnVectorInfo;
+import org.apache.carbondata.core.util.ByteUtil;

+/**
+ * ColumnPage wrapper for dimension column reader
+ */
public class ColumnPageWrapper implements DimensionColumnDataChunk {

+ private TableSpec.ColumnSpec columnSpec;
private ColumnPage columnPage;
+ private int columnValueSize;

- public ColumnPageWrapper(ColumnPage columnPage) {
+ public ColumnPageWrapper(ColumnPage columnPage, int columnValueSize) {
this.columnPage = columnPage;
+ this.columnValueSize = columnValueSize;
+ this.columnSpec = columnPage.getColumnSpec();
}

@Override
public int fillChunkData(byte[] data, int offset, int columnIndex,
KeyStructureInfo restructuringInfo) {
- throw new UnsupportedOperationException("internal error");
+ int surrogate = columnPage.getInt(columnIndex);
+ ByteBuffer buffer = ByteBuffer.wrap(data);
+ buffer.putInt(offset, surrogate);
+ return columnValueSize;
}

@Override
public int fillConvertedChunkData(int rowId, int columnIndex, int[] row,
KeyStructureInfo restructuringInfo) {
- throw new UnsupportedOperationException("internal error");
+ row[columnIndex] = columnPage.getInt(rowId);
+ return columnIndex + 1;
}

@Override
public int fillConvertedChunkData(ColumnVectorInfo[] vectorInfo, int column,
KeyStructureInfo restructuringInfo) {
- throw new UnsupportedOperationException("internal error");
+ // fill the vector with data in column page
+ ColumnVectorInfo columnVectorInfo = vectorInfo[column];
+ CarbonColumnVector vector = columnVectorInfo.vector;
+ fillData(null, columnVectorInfo, vector);
+ return column + 1;
}

@Override
public int fillConvertedChunkData(int[] rowMapping, ColumnVectorInfo[] vectorInfo, int column,
KeyStructureInfo restructuringInfo) {
- throw new UnsupportedOperationException("internal error");
+ ColumnVectorInfo columnVectorInfo = vectorInfo[column];
+ CarbonColumnVector vector = columnVectorInfo.vector;
+ fillData(rowMapping, columnVectorInfo, vector);
+ return column + 1;
+ }
+
+ /**
+ * Copy columnar data from internal column page to `vector`, row information described by
+ * `columnVectorInfo`
+ */
+ private void fillData(int[] rowMapping, ColumnVectorInfo columnVectorInfo,
--- End diff --

Since all the code is handled here I think we can remove `FixedLengthDimensionDataChunk` and `VariableLengthDimensionDataChunk` and all related class hierarchy right.

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137490913

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeLessThanEqualFilterExecuterImpl.java ---
@@ -262,10 +259,7 @@ private BitSet getFilteredIndexes(DimensionColumnDataChunk dimensionColumnDataCh
int numerOfRows) {
byte[] defaultValue = null;
if (dimColEvaluatorInfoList.get(0).getDimension().hasEncoding(Encoding.DIRECT_DICTIONARY)) {
- DirectDictionaryGenerator directDictionaryGenerator = DirectDictionaryKeyGeneratorFactory
- .getDirectDictionaryGenerator(
- dimColEvaluatorInfoList.get(0).getDimension().getDataType());
- int key = directDictionaryGenerator.generateDirectSurrogateKey(null) + 1;
+ int key = 0;
--- End diff --

And also the null value for dictionary is 1 not 0

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137805419

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeGrtThanFiterExecuterImpl.java ---
@@ -366,7 +364,7 @@ private BitSet setFilterdIndexToBitSet(DimensionColumnDataChunk dimensionColumnD
BitSet bitSet = new BitSet(numerOfRows);
byte[][] filterValues = this.filterRangeValues;
// binary search can only be applied if column is sorted
- if (isNaturalSorted) {
+ if (isNaturalSorted && !dimensionColumnDataChunk.isNoDicitionaryColumn()) {
--- End diff --

In this PR, plain value (no dictionary) is using adaptive encoding, it is not sorted. I think it will make store size smaller but for some filter query it will be slower. Any suggestion?

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137806977

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/partition/EqualToFilterImpl.java ---
@@ -50,7 +50,7 @@ public EqualToFilterImpl(EqualToExpression equalTo, PartitionInfo partitionInfo)
literal.getLiteralExpValue().toString(),
partitionInfo.getColumnSchemaList().get(0).getDataType());
if (PartitionType.RANGE == partitionInfo.getPartitionType() && value instanceof String) {
- value = ByteUtil.toBytes((String)value);
+ value = ByteUtil.toBytesForPlainValue((String)value);
--- End diff --

`ByteUtil.toBytes` will do `val = val ^ Integer.MIN_VALUE;`
So I think it can not be used as normal utility function to convert int to byte[4]

---

qiuchenjian-2

[GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add encoding for non-dictio...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1265#discussion_r137809952

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/store/ColumnPageWrapper.java ---
@@ -17,46 +17,183 @@

package org.apache.carbondata.core.datastore.chunk.store;

+import java.nio.ByteBuffer;
+import java.util.BitSet;
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.ColumnType;
+import org.apache.carbondata.core.datastore.TableSpec;
import org.apache.carbondata.core.datastore.chunk.DimensionColumnDataChunk;
import org.apache.carbondata.core.datastore.page.ColumnPage;
+import org.apache.carbondata.core.keygenerator.directdictionary.DirectDictionaryGenerator;
+import org.apache.carbondata.core.metadata.datatype.DataType;
import org.apache.carbondata.core.scan.executor.infos.KeyStructureInfo;
+import org.apache.carbondata.core.scan.result.vector.CarbonColumnVector;
import org.apache.carbondata.core.scan.result.vector.ColumnVectorInfo;
+import org.apache.carbondata.core.util.ByteUtil;

+/**
+ * ColumnPage wrapper for dimension column reader
+ */
public class ColumnPageWrapper implements DimensionColumnDataChunk {

+ private TableSpec.ColumnSpec columnSpec;
private ColumnPage columnPage;
+ private int columnValueSize;

- public ColumnPageWrapper(ColumnPage columnPage) {
+ public ColumnPageWrapper(ColumnPage columnPage, int columnValueSize) {
this.columnPage = columnPage;
+ this.columnValueSize = columnValueSize;
+ this.columnSpec = columnPage.getColumnSpec();
}

@Override
public int fillChunkData(byte[] data, int offset, int columnIndex,
KeyStructureInfo restructuringInfo) {
- throw new UnsupportedOperationException("internal error");
+ int surrogate = columnPage.getInt(columnIndex);
+ ByteBuffer buffer = ByteBuffer.wrap(data);
+ buffer.putInt(offset, surrogate);
+ return columnValueSize;
}

@Override
public int fillConvertedChunkData(int rowId, int columnIndex, int[] row,
KeyStructureInfo restructuringInfo) {
- throw new UnsupportedOperationException("internal error");
+ row[columnIndex] = columnPage.getInt(rowId);
+ return columnIndex + 1;
}

@Override
public int fillConvertedChunkData(ColumnVectorInfo[] vectorInfo, int column,
KeyStructureInfo restructuringInfo) {
- throw new UnsupportedOperationException("internal error");
+ // fill the vector with data in column page
+ ColumnVectorInfo columnVectorInfo = vectorInfo[column];
+ CarbonColumnVector vector = columnVectorInfo.vector;
+ fillData(null, columnVectorInfo, vector);
+ return column + 1;
}

@Override
public int fillConvertedChunkData(int[] rowMapping, ColumnVectorInfo[] vectorInfo, int column,
KeyStructureInfo restructuringInfo) {
- throw new UnsupportedOperationException("internal error");
+ ColumnVectorInfo columnVectorInfo = vectorInfo[column];
+ CarbonColumnVector vector = columnVectorInfo.vector;
+ fillData(rowMapping, columnVectorInfo, vector);
+ return column + 1;
+ }
+
+ /**
+ * Copy columnar data from internal column page to `vector`, row information described by
+ * `columnVectorInfo`
+ */
+ private void fillData(int[] rowMapping, ColumnVectorInfo columnVectorInfo,
--- End diff --

For dictionary column, it still using `IndexStorageCodec` to do encoding and using `decodeDimensionLegacy` to decode, so `FixedLengthDimensionDataChunk` is still required. `VariableLengthDimensionDataChunk` is also required for backward compatiblity to read old data file

---

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1265

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/627/

---

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1265

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/24/

---

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1265

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/1205/

---

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1265

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/574/

---

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1265

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/3078/

---

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #1265: [CARBONDATA-1128] Add encoding for non-dictionary di...

In reply to this post by qiuchenjian-2

123