[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

classic Classic list List threaded Threaded
43 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/2379

    [CARBONDATA-2420][32K] Support string longer than 32000 characters

    Add a property in creating table 'long_string_columns' to support string columns that will contains more than 32000 characters.
    Inside carbondata, it use an integer instead of short to store the length of bytes content.
   
    Internally in Carbondata,
    1. add a new datatype called `varchar` to represent the long string column
    2. add a new encoding called `DIRECT_COMPRESS_VARCHAR` to the archer column page meta
    3. use an integer (previously short) to store the length of bytes content.
    4. add 2GB constraint for one column page
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [x] Any interfaces changed?
     `Only internal interfaces have been changed`
     - [x] Any backward compatibility impacted?
     `NO`
     - [x] Document update required?
    `YES`
     - [x] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
    `Added tests`
            - How it is tested? Please attach test report.
    `Tested in local machine`
            - Is it a performance related change? Please attach the performance test report.
    `NO`
            - Any additional information to help reviewers in testing this change.
          `NA`
     - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
    `NA`

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata 0602_support_long_string

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2379
   
----
commit b689d66493521452ff9938415e0d0aa66b56c2c5
Author: xuchuanyin <xuchuanyin@...>
Date:   2018-06-02T07:17:04Z

    Support string longer than 32000 characters
   
    Add a table property 'long_string_columns' in create table DDL that
    indicate those columns will contain more than 32000 characters.
   
    Internally in Carbondata,
    1. add a new datatype called `text` to represent the long string column
    2. add a new encoding called `DIRECT_COMPRESS_TEXT` to the text column
    page meta
    3. Use an integer (previously short) to store the length of bytes
    content.

commit f145c6c60238c400b5db6a6bf2696246b698154a
Author: xuchuanyin <xuchuanyin@...>
Date:   2018-06-05T12:46:26Z

    rename datatype name from text to varchar

commit 4180f8118d1ff90205b0f1567bef2cdfee3a1b62
Author: xuchuanyin <xuchuanyin@...>
Date:   2018-06-12T12:35:58Z

    Add 2GB constraint for one column page

commit 710845b155ed5b7a611a900c70b0d766d80ae48d
Author: xuchuanyin <xuchuanyin@...>
Date:   2018-06-14T12:11:40Z

    update tests

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2379: [CARBONDATA-2420][32K] Support string longer than 32...

qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2379
 
    @kumarvishal09
    Can you review and merge it? The previous discussion can be found in #2252


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2379: [CARBONDATA-2420][32K] Support string longer than 32...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2379
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6366/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2379: [CARBONDATA-2420][32K] Support string longer than 32...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2379
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5204/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2379: [CARBONDATA-2420][32K] Support string longer than 32...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2379
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2379: [CARBONDATA-2420][32K] Support string longer than 32...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2379
 
    retest it please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2379: [CARBONDATA-2420][32K] Support string longer than 32...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2379
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6372/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2379: [CARBONDATA-2420][32K] Support string longer than 32...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2379
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5210/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196437393
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
    @@ -1601,6 +1602,8 @@
       // As Short data type is used for storing the length of a column during data processing hence
       // the maximum characters that can be supported should be less than Short max value
       public static final int MAX_CHARS_PER_COLUMN_DEFAULT = 32000;
    +  // todo: use infinity first, will switch later
    +  public static final int MAX_CHARS_PER_COLUMN_INFINITY = -1;
    --- End diff --
   
    why this is -1?? as per previous discussion we can support max 67104 chars


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196440793
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/VarLengthColumnPageBase.java ---
    @@ -289,6 +289,12 @@ public void putDouble(int rowId, double value) {
     
       @Override
       public void putBytes(int rowId, byte[] bytes) {
    +    // rowId * 4 represents the length of L in LV
    +    if (bytes.length > (Integer.MAX_VALUE - totalLength - rowId * 4)) {
    --- End diff --
   
    Better to add a check in FieldEncoder to validate if it crosses 67104 bytes for varchar datatype, it should be fail first mechanism, here it will fail while writing which is too late as most of the data transformation is done.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196451306
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RawColumnChunkUtil.java ---
    @@ -0,0 +1,65 @@
    +/*
    --- End diff --
   
    Is it related to this PR, if not please raise in separate PR


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196454746
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/encoding/DefaultEncodingFactory.java ---
    @@ -103,6 +103,7 @@ private ColumnPageEncoder createEncoderForDimensionLegacy(TableSpec.DimensionSpe
             return new HighCardDictDimensionIndexCodec(
                 dimensionSpec.isInSortColumns(),
    --- End diff --
   
    Instead of passing all the properties of dimensionspec individually, pass dimensionspec it self.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196455199
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/encoding/EncodingFactory.java ---
    @@ -71,7 +72,7 @@ public ColumnPageDecoder createDecoder(List<Encoding> encodings, List<ByteBuffer
         byte[] encoderMeta = encoderMetas.get(0).array();
         ByteArrayInputStream stream = new ByteArrayInputStream(encoderMeta);
         DataInputStream in = new DataInputStream(stream);
    -    if (encoding == DIRECT_COMPRESS) {
    +    if (encoding == DIRECT_COMPRESS || encoding == DIRECT_COMPRESS_VARCHAR) {
    --- End diff --
   
    why DIRECT_COMPRESS_VARCHAR is required?? you can use DIRECT_COMPRESS for varchar


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196455373
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/encoding/compress/DirectCompressCodec.java ---
    @@ -64,7 +64,7 @@ public ColumnPageDecoder createDecoder(ColumnPageEncoderMeta meta) {
         return new DirectDecompressor(meta);
       }
     
    -  private static class DirectCompressor extends ColumnPageEncoder {
    --- End diff --
   
    Why this change is required ??


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196456536
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/statistics/LVLongStringStatsCollector.java ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.core.datastore.page.statistics;
    +
    +/**
    + * This class is for the columns with varchar data type,
    + * a string type which can hold more than 32000 characters
    + */
    +public class LVLongStringStatsCollector extends LVStringStatsCollector {
    +
    +  public static LVLongStringStatsCollector newInstance() {
    +    return new LVLongStringStatsCollector();
    +  }
    +
    +  private LVLongStringStatsCollector() {
    +
    +  }
    +
    +  @Override
    +  protected byte[] getActualValue(byte[] value) {
    +    byte[] actualValue;
    +    assert (value.length >= 4);
    +    if (value.length == 4) {
    +      assert (value[0] == 0 && value[1] == 0);
    +      actualValue = new byte[0];
    +    } else {
    +      // todo: what does this mean?
    +      // int length = (value[0] << 8) + (value[1] & 0xff);
    --- End diff --
   
    Here first four bytes you have to consider to get the length of the data, please refer code in ShortLength collector


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196458442
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/blocklet/BlockletInfo.java ---
    @@ -205,7 +205,7 @@ public void setNumberOfPages(int numberOfPages) {
         output.writeLong(dimensionOffset);
         output.writeLong(measureOffsets);
         int dsize = dimensionChunkOffsets != null ? dimensionChunkOffsets.size() : 0;
    -    output.writeShort(dsize);
    +    output.writeInt(dsize);
    --- End diff --
   
    Why you have changed this.
    This is for number of dimension columns, how it will change for varchar data type ??


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196458488
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/blocklet/BlockletInfo.java ---
    @@ -268,7 +268,7 @@ private DataChunk deserializeDataChunk(byte[] bytes) throws IOException {
       @Override public void readFields(DataInput input) throws IOException {
         dimensionOffset = input.readLong();
         measureOffsets = input.readLong();
    -    short dimensionChunkOffsetsSize = input.readShort();
    +    int dimensionChunkOffsetsSize = input.readInt();
    --- End diff --
   
    Same as above


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2379: [CARBONDATA-2420][32K] Support string longer than 32...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2379
 
    @xuchuanyin Please remove all the changes which is not related to this pr


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196460046
 
    --- Diff: integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala ---
    @@ -279,7 +279,7 @@ abstract class CarbonDDLSqlParser extends AbstractCarbonSparkSQLParser {
         fields.zipWithIndex.foreach { case (field, index) =>
           field.schemaOrdinal = index
         }
    -    val (dims, msrs, noDictionaryDims, sortKeyDims) = extractDimAndMsrFields(
    +    val (dims, msrs, noDictionaryDims, sortKeyDims, varcharColumns) = extractDimAndMsrFields(
    --- End diff --
   
    why this varcharcolumns is required as varchar columns is nothing but dimension column of string data type


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2379: [CARBONDATA-2420][32K] Support string longer ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2379#discussion_r196633177
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RawColumnChunkUtil.java ---
    @@ -0,0 +1,65 @@
    +/*
    --- End diff --
   
    OK


---
123