Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2252: WIP: Support string longer than 32000 charact...

Classic

List

80 messages Options

Options

1234

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190220126

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/block/SegmentProperties.java ---
@@ -849,7 +852,41 @@ public int getNumberOfDictSortColumns() {
return this.numberOfSortColumns - this.numberOfNoDictSortColumns;
}

+ public int getNumberOfLongStringColumns() {
+ return numberOfLongStringColumns;
+ }
+
public int getLastDimensionColOrdinal() {
return lastDimensionColOrdinal;
}
+
+ @Override public String toString() {
--- End diff --

It's not required. I used it for debug output. Will remove it.

---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2252#discussion_r190222169

--- Diff: core/src/main/java/org/apache/carbondata/core/scan/result/BlockletScannedResult.java ---
@@ -369,6 +379,9 @@ public void fillDataChunks() {
long startTime = System.currentTimeMillis();
for (int i = 0; i < dimensionColumnPages.length; i++) {
if (dimensionColumnPages[i][pageCounter] == null && dimRawColumnChunks[i] != null) {
+ // the long string columns is at the end
--- End diff --

@kumarvishal09
During query, we can get the dimensions that this query requires.
Also we can know how many longStringColumns this query requires.
While writing dimension columns, we write the normal-shortStringColumns first and longStringColumns at last.
So during query, suppose we use `n` dimensions containing `m` longStringColumns.
Then, the first `n-m` columns will be normal-shortStringColumns and the last `m` columns will be longStringColumns.
This line of code can be found below.

---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/2252

@xuchuanyin For supporting String column with more than 32K character we need below changes.
**Create**
1. Support new data type varchar already mentioned by @ravipesala
**Loading:**
1. Add new encoder . Add this encoder for all the varchar columns in datachunk2 while writing the data to carbondata file. Please check DataChunk2 in carbondata.thrift we are adding encoder for each column
2. Use DirectCompressCodec for compressing the data already code is present in ColumnPage.getLVFlattenedBytePage()
3. Add stats collector for computing max/min for varchar columns implement new class for handling the same
4. No need to add sartkey and endkey for varchar columns,
**Reading**
1. Add new implementation for DimensionDataChunkStore to store INT LV format data.(already handled)
2. Based on encoder present in datachunk2 use DimensionDataChunkStore implementation. Like for dictionary encoder we are creating fixedLengthStoreChunk object
3. For varchar column just uncompress the data and store the same data with LV format in store (No need to convert LV formatted data to 2D byte array)

**Note:**: We need to handle the same changes for complex data type.
Please take care of backward compatibility :-)
Please let me know for any clarification.

@ravipesala @jackylk . Please check if I missed anything.

---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6072/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4913/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2252

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5074/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6094/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4933/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2252

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5087/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6097/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4935/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2252

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5089/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2252

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5090/

---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2252

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5091/

---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2252

@kumarvishal09 please review the latest update.

1. still use 'long_string_columns' instead of `varchar` datatype to make it consistent it spark/hive
2. internally add a new datatype called `text` to represent the long string column
3. add a new encoding called DIRECT_COMPRESS_TEXT to the text column page meta
4. use an integer (previously short) to store the length of bytes content.
5. A test was added to test query/select on the text columns

---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6240/

---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2252

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5078/

---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2252

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5210/

---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/2252

@xuchuanyin
1. still use 'long_string_columns' instead of varchar datatype to make it consistent it spark/hive
Are u facing any problem with varchar??
2. use an integer (previously short) to store the length of bytes content.
Only for text data type??

---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2252

@kumarvishal09

1. There isn't a problem. I've discussed with Jacky and Ravindra, they agreed that user can specify the longStringColumn by using the âlong_string_columnsâ property.
They also agreed that we can provide `varchar` for the longStringColumn.

In this initial implementation, I want to keep it simple.
Using `varchar` will have more things to deal with such as dataframe only has StringType, so I also need to consider the writing DF to CarbonData.
Besides, in Hive document, Hive will truncate varchar/char to specified length, while in Spark, spark will handle varchar as String.
In a word, if we use varchar, more things need to be considered.

2. yeah, only for text data type now.

@kumarvishal09 As an initial implementation, I think it's already easy to use for users. How do you think?

---

1234