Grammar about supporting string longer than 32000 characters

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Grammar about supporting string longer than 32000 characters

xuchuanyin


Hi, community:

I'm implementing supporting string longer than 32000 characters in carbondata and have a question about the grammar of this feature. Here I'd like to explain it and want to receive your feedbacks.

DESCRIPTION:

In previous implementation, carbondata internally uses a short to store the length of a string (char,varchar,string are all treated as string) value. It does save memory and space for the regular use case by using short instead of int, but will cause problem when the length of string exceeds the range of short.

In order to support the above case and save memory/space if possible, we want to distinguish which string columns are SHORT and which are LONG. (SHORT: lengthOfValue<32000, LONG: lengthOfValue>=32000)


SOLUTION:

Solution1. Add a TableProperty in CreateTableStatement
The grammar looks like below:
For SQL case:
```
    sql(
      s"""
         | CREATE TABLE if not exists $longStringTable(
         | id INT, name STRING, description STRING, address STRING
         | ) STORED BY 'carbondata'
         | TBLPROPERTIES('LONG_STRING_COLUMNS'='description', 'SORT_COLUMNS'='name')
         |""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
  .option("tableName", "longStringTable")
  .option("SORT_COLUMNS", 'name')
  .option('LONG_STRING_COLUMNS', 'description')
  .mode(SaveMode.Overwrite)
  .save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a property called 'LONG_STRING_COLUMNS' in the statement, its value contains all the columns that are LONG.
Note: This solution is available as PR2252.


Solution2. Add a new datatype
The grammar looks like below:
For SQL case:
```
    sql(
      s"""
         | CREATE TABLE if not exists $longStringTable(
         | id INT, name STRING, description TEXT, address STRING
         | ) STORED BY 'carbondata'
         | TBLPROPERTIES('SORT_COLUMNS'='name')
         |""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
  .option("tableName", "longStringTable")
  .option("SORT_COLUMNS", 'name')
  // SparkSQL does not have TEXT datatype, how to specify it?
  .mode(SaveMode.Overwrite)
  .save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a new datatype called 'TEXT' and treat 'description' as TEXT in the statement.

I'd prefer to solution1 since it is compatible with hive/sparksql while solution2 has problem to migrate with them.

END



At last, how do you think the solutions provided above?

Please give your comments, moreover you can provide other solutions here to improve it.
Reply | Threaded
Open this post in threaded view
|

Re: Grammar about supporting string longer than 32000 characters

ravipesala
Hi,

I agree with option 2 but not new datatype use varchar(size).
There are more optimizations we can do with varchar(size) datatype like
1. if the size is smaller (less than 8 bytes)  then we can write in fixed
length encoder instead of  LV encode it can save a lot of space and memory.
2. If the size is less than 32000 then use current our string datatype.
3. If size is more than 32000 then encode using int as a length in LV
format.

In spark dataframe support we can by default use string as datatype.

Even if we take option 1 also carbon should internally has new datatype
otherwise code will not be good as you need to check this property many
places so ideally new datatype can lead to a new set of implementations and
easier to code and maintain.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Grammar about supporting string longer than 32000 characters

sraghunandan
+1 for solution 2
On Wed, 2 May 2018 at 9:09 PM, ravipesala <[hidden email]> wrote:

> Hi,
>
> I agree with option 2 but not new datatype use varchar(size).
> There are more optimizations we can do with varchar(size) datatype like
> 1. if the size is smaller (less than 8 bytes)  then we can write in fixed
> length encoder instead of  LV encode it can save a lot of space and memory.
> 2. If the size is less than 32000 then use current our string datatype.
> 3. If size is more than 32000 then encode using int as a length in LV
> format.
>
> In spark dataframe support we can by default use string as datatype.
>
> Even if we take option 1 also carbon should internally has new datatype
> otherwise code will not be good as you need to check this property many
> places so ideally new datatype can lead to a new set of implementations and
> easier to code and maintain.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Grammar about supporting string longer than 32000 characters

xuchuanyin
In traditional RDBMS, varchar(N) means the value contains at least N characters, at the DBMS will truncate the value if its length is longer than N.

Will we implement like this too? Truncate the string value to N if its length is longer than N?
Reply | Threaded
Open this post in threaded view
|

Re: Grammar about supporting string longer than 32000 characters

xm_zzc
+1 for solution 2
I think it's ok to truncate the string value to N if its length is longer
than N, it's acceptable.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Grammar about supporting string longer than 32000 characters

xuchuanyin
Spark dataframe does not have char/varchar datatype, it only has string type.

Suppose we are saving spark dataframe to carbonata, if the length of value is longer than 32000 and its datatype is string, how will carbondata know to store it as longstring?
In my opinion, a property to indicate which column is longstring is inevitable in the case.
Reply | Threaded
Open this post in threaded view
|

Re: Grammar about supporting string longer than 32000 characters

ravipesala
In case of dataframe we can take the varchar(max) as default.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/