Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2252: WIP: Support string longer than 32000 charact...

Classic

List

Threaded

80 messages Options

1234

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4859/

---

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2252

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5027/

---

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2252

To eliminate if-else on longString as much as possible,
+ For DataMapSchemaType, I changed VARIABLE to VARIABLE_INT and VARIABLE_SHORT, they are used for BlockletDataMap;
+ For DimensionStoreType, I changed VARIABLELENGTH to VARIABLE_INT_LENGTH and VARIABLE_SHORT_LENGTH, they are used for encoding/decoding dimensions;
+ For ColumnPageStatCollector, I changed LVStringStatCollector to LVShortStringStatCollector and LVLongStringStatCollector, they are used for column statistics;

While deep into the code, I found that there is no need to add an internal datatype such as TEXT.
+ The dimensions all are considered as String, we only need a boolean array to indicate whether it is long string. We don't need an array to indicate all the datatype of the dimensions.
+ The procedure for String and Text are nearly the same. A boolean (array) and proper abstraction is enough.

---

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2252

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5039/

---

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4867/

---

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6026/

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r189911364

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
@@ -92,8 +93,10 @@ public DimensionColumnPage decodeColumnPage(int pageNumber) {
* @param index
* @return
*/
- public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index) {
+ public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index,
+ boolean isLongStringColumn) {
--- End diff --

why this is required???...as mentioned in earlier comment...reader does not know which type of data it is reading

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r189912968

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
@@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo

@Override public DimensionColumnPage decodeColumnPage(
DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
+ boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
--- End diff --

In V1 case it will be always false...as new encoder type will be only supported for V3 format

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r189913075

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
@@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
int[] invertedIndexesReverse = new int[0];
int[] rlePage = null;
DataChunk2 dimensionColumnChunk = null;
+ boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
--- End diff --

In V2 case it will be always false...as new encoder type will be only supported for V3 format

---

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/2252

@xuchuanyin I think some of the changes are not required ...like method added for converting to LV format..If u check direct compressor it is already present ..you can use the same.
In DimensionRawColumnChunk no need to pass any Boolean based on encoder we can decide which type of dimension store object needs to be created
Changing the existing store chunk implementation is also not required ...add a child class if possible or add complete new implementation for storing int based LV format...
Please check ColumnPage.compress and ColumnPage.decompress for you reference(LV)
V1,V2 reader no need to change anything as its old format code and user will not able to load the data in this format.
Decide based on encoder. while writing add new encoder "TEXT" and while reading use same encoder for creating DimensionDataChunkStore object

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190103691

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
@@ -92,8 +93,10 @@ public DimensionColumnPage decodeColumnPage(int pageNumber) {
* @param index
* @return
*/
- public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index) {
+ public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index,
+ boolean isLongStringColumn) {
--- End diff --

What does `reader` mean?

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190103851

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
@@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo

@Override public DimensionColumnPage decodeColumnPage(
DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
+ boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
--- End diff --

Should we support longStringColumn in V1?

Carbondata now only support writing V1 format.

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190103945

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
@@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
int[] invertedIndexesReverse = new int[0];
int[] rlePage = null;
DataChunk2 dimensionColumnChunk = null;
+ boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
--- End diff --

The same as above. V2 does not support longStringColumn, so it is fine that the `isLongStringColumn` will always be false.

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190104257

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
@@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo

@Override public DimensionColumnPage decodeColumnPage(
DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
+ boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
--- End diff --

I know, you mean just leave it false?

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190104281

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
@@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
int[] invertedIndexesReverse = new int[0];
int[] rlePage = null;
DataChunk2 dimensionColumnChunk = null;
+ boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
--- End diff --

The same as above

---

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2252

@kumarvishal09 You suggested to add a new TEXT encoder and using this encoder while writing&reading.
But currently in CarbonData, all dimensions are considered as string, there is no specified encoder for it.
In the previous description,
```
For DimensionStoreType, I changed VARIABLELENGTH to VARIABLE_INT_LENGTH and VARIABLE_SHORT_LENGTH, they are used for encoding/decoding dimensions
```
We can consider the DimensionStoreType as an encoder. It had two valuesï¼ FIXEDLENGTH and VARIABLELENGTH and I extended it to treeï¼FIXED_LENGTHãVARIABLE_SHORT_LENGTHãVARIABLE_INT_LENGTH.

Does this meet your suggestion?

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190150092

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
@@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo

@Override public DimensionColumnPage decodeColumnPage(
DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
+ boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
--- End diff --

No we support V3 while writing the data...v1/v2 is supported only for reading to support backward compatibility

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190160809

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/block/SegmentProperties.java ---
@@ -849,7 +852,41 @@ public int getNumberOfDictSortColumns() {
return this.numberOfSortColumns - this.numberOfNoDictSortColumns;
}

+ public int getNumberOfLongStringColumns() {
+ return numberOfLongStringColumns;
+ }
+
public int getLastDimensionColOrdinal() {
return lastDimensionColOrdinal;
}
+
+ @Override public String toString() {
--- End diff --

Why this method is required ??...can u add some comment

---

qiuchenjian-2

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190161224

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
@@ -32,13 +32,14 @@
* by specifying page number.
*/
public class DimensionRawColumnChunk extends AbstractRawColumnChunk {
-
private DimensionColumnPage[] dataChunks;

private DimensionColumnChunkReader chunkReader;

private FileReader fileReader;
+ private boolean isLongStringColumn;
--- End diff --

how u are deciding it is isLongStringColumn or not ??

---

qiuchenjian-2

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/2252

@xuchuanyin how u are deciding isLongStringColumn is true or false in query??

---

1234