[GitHub] carbondata pull request #2252: WIP: Support string longer than 32000 charact...

classic Classic list List threaded Threaded
80 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4859/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5027/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    To eliminate if-else on longString as much as possible,
    + For DataMapSchemaType, I changed VARIABLE to VARIABLE_INT and VARIABLE_SHORT, they are used for BlockletDataMap;
    + For DimensionStoreType, I changed VARIABLELENGTH to VARIABLE_INT_LENGTH and VARIABLE_SHORT_LENGTH, they are used for encoding/decoding dimensions;
    + For ColumnPageStatCollector, I changed LVStringStatCollector to LVShortStringStatCollector and LVLongStringStatCollector, they are used for column statistics;
   
    While deep into the code, I found that there is no need to add an internal datatype such as TEXT.
    + The dimensions all are considered as String, we only need a boolean array to indicate whether it is long string. We don't need an array to indicate all the datatype of the dimensions.
    + The procedure for String and Text are nearly the same. A boolean (array) and proper abstraction is enough.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5039/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4867/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6026/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r189911364
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
    @@ -92,8 +93,10 @@ public DimensionColumnPage decodeColumnPage(int pageNumber) {
        * @param index
        * @return
        */
    -  public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index) {
    +  public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index,
    +      boolean isLongStringColumn) {
    --- End diff --
   
    why this is required???...as mentioned in earlier comment...reader does not know which type of data it is reading


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r189912968
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
    @@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo
     
       @Override public DimensionColumnPage decodeColumnPage(
           DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
   
    In V1 case it will be always false...as new encoder type will be only supported for V3 format


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r189913075
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
    @@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
         int[] invertedIndexesReverse = new int[0];
         int[] rlePage = null;
         DataChunk2 dimensionColumnChunk = null;
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
   
    In V2 case it will be always false...as new encoder type will be only supported for V3 format


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    @xuchuanyin I think some of the changes are not required ...like method added for converting to LV format..If u check direct compressor  it is already present ..you can use the same.
    In DimensionRawColumnChunk no need to pass any Boolean based on encoder we can decide which type of dimension store object needs to be created
    Changing the existing store chunk implementation is also not required ...add a child class if possible or add complete new implementation for storing int based LV format...
    Please check ColumnPage.compress and ColumnPage.decompress for you reference(LV)
    V1,V2 reader no need to change anything as its old format code and user will not able to load the data in this format.
    Decide based on encoder. while writing add new encoder "TEXT" and while reading use same encoder for creating DimensionDataChunkStore object
   
     


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190103691
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
    @@ -92,8 +93,10 @@ public DimensionColumnPage decodeColumnPage(int pageNumber) {
        * @param index
        * @return
        */
    -  public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index) {
    +  public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index,
    +      boolean isLongStringColumn) {
    --- End diff --
   
    What does `reader` mean?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190103851
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
    @@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo
     
       @Override public DimensionColumnPage decodeColumnPage(
           DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
   
    Should we support longStringColumn in V1?
   
    Carbondata now only support writing V1 format.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190103945
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
    @@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
         int[] invertedIndexesReverse = new int[0];
         int[] rlePage = null;
         DataChunk2 dimensionColumnChunk = null;
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
   
    The same as above. V2 does not support longStringColumn, so it is fine that the `isLongStringColumn` will always be false.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190104257
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
    @@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo
     
       @Override public DimensionColumnPage decodeColumnPage(
           DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
   
    I know, you mean just leave it false?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190104281
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
    @@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
         int[] invertedIndexesReverse = new int[0];
         int[] rlePage = null;
         DataChunk2 dimensionColumnChunk = null;
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
   
    The same as above


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    @kumarvishal09 You suggested to add a new TEXT encoder and using this encoder while writing&reading.
    But currently in CarbonData, all dimensions are considered as string, there is no specified encoder for it.
    In the previous description,
    ```
    For DimensionStoreType, I changed VARIABLELENGTH to VARIABLE_INT_LENGTH and VARIABLE_SHORT_LENGTH, they are used for encoding/decoding dimensions
    ```
    We can consider the DimensionStoreType as an encoder. It had two values: FIXEDLENGTH and VARIABLELENGTH and I extended it to tree:FIXED_LENGTH、VARIABLE_SHORT_LENGTH、VARIABLE_INT_LENGTH.
   
    Does this meet your suggestion?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190150092
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
    @@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo
     
       @Override public DimensionColumnPage decodeColumnPage(
           DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
   
    No we support V3 while writing the data...v1/v2 is supported only for reading to support backward compatibility


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190160809
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/block/SegmentProperties.java ---
    @@ -849,7 +852,41 @@ public int getNumberOfDictSortColumns() {
         return this.numberOfSortColumns - this.numberOfNoDictSortColumns;
       }
     
    +  public int getNumberOfLongStringColumns() {
    +    return numberOfLongStringColumns;
    +  }
    +
       public int getLastDimensionColOrdinal() {
         return lastDimensionColOrdinal;
       }
    +
    +  @Override public String toString() {
    --- End diff --
   
    Why this method is required ??...can u add some comment


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190161224
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
    @@ -32,13 +32,14 @@
      *  by specifying page number.
      */
     public class DimensionRawColumnChunk extends AbstractRawColumnChunk {
    -
       private DimensionColumnPage[] dataChunks;
     
       private DimensionColumnChunkReader chunkReader;
     
       private FileReader fileReader;
    +  private boolean isLongStringColumn;
    --- End diff --
   
    how u are deciding it is isLongStringColumn or not ??


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2252
 
    @xuchuanyin how u are deciding isLongStringColumn is true or false in query??


---
1234