Apache CarbonData Dev Mailing List archive

Grammar about supporting string longer than 32000 characters

Classic

List

Threaded

7 messages Options

xuchuanyin

Grammar about supporting string longer than 32000 characters

Hi, community:

I'm implementing supporting string longer than 32000 characters in carbondata and have a question about the grammar of this feature. Here I'd like to explain it and want to receive your feedbacks.

DESCRIPTION：

In previous implementation, carbondata internally uses a short to store the length of a string (char,varchar,string are all treated as string) value. It does save memory and space for the regular use case by using short instead of int, but will cause problem when the length of string exceeds the range of short.

In order to support the above case and save memory/space if possible, we want to distinguish which string columns are SHORT and which are LONG. (SHORT: lengthOfValue<32000, LONG: lengthOfValue>=32000)

SOLUTION：

Solution1. Add a TableProperty in CreateTableStatement
The grammar looks like below：
For SQL case：
```
sql(
s"""
| CREATE TABLE if not exists $longStringTable(
| id INT, name STRING, description STRING, address STRING
| ) STORED BY 'carbondata'
| TBLPROPERTIES('LONG_STRING_COLUMNS'='description', 'SORT_COLUMNS'='name')
|""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
.option("tableName", "longStringTable")
.option("SORT_COLUMNS", 'name')
.option('LONG_STRING_COLUMNS', 'description')
.mode(SaveMode.Overwrite)
.save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a property called 'LONG_STRING_COLUMNS' in the statement, its value contains all the columns that are LONG.
Note: This solution is available as PR2252.

Solution2. Add a new datatype
The grammar looks like below:
For SQL case：
```
sql(
s"""
| CREATE TABLE if not exists $longStringTable(
| id INT, name STRING, description TEXT, address STRING
| ) STORED BY 'carbondata'
| TBLPROPERTIES('SORT_COLUMNS'='name')
|""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
.option("tableName", "longStringTable")
.option("SORT_COLUMNS", 'name')
// SparkSQL does not have TEXT datatype, how to specify it?
.mode(SaveMode.Overwrite)
.save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a new datatype called 'TEXT' and treat 'description' as TEXT in the statement.

I'd prefer to solution1 since it is compatible with hive/sparksql while solution2 has problem to migrate with them.

END

At last, how do you think the solutions provided above?

Please give your comments, moreover you can provide other solutions here to improve it.

ravipesala

Re: Grammar about supporting string longer than 32000 characters

Hi,

I agree with option 2 but not new datatype use varchar(size).
There are more optimizations we can do with varchar(size) datatype like
1. if the size is smaller (less than 8 bytes) then we can write in fixed
length encoder instead of LV encode it can save a lot of space and memory.
2. If the size is less than 32000 then use current our string datatype.
3. If size is more than 32000 then encode using int as a length in LV
format.

In spark dataframe support we can by default use string as datatype.

Even if we take option 1 also carbon should internally has new datatype
otherwise code will not be good as you need to check this property many
places so ideally new datatype can lead to a new set of implementations and
easier to code and maintain.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

sraghunandan

Re: Grammar about supporting string longer than 32000 characters

+1 for solution 2
On Wed, 2 May 2018 at 9:09 PM, ravipesala <[hidden email]> wrote:

> Hi,
>
> I agree with option 2 but not new datatype use varchar(size).
> There are more optimizations we can do with varchar(size) datatype like
> 1. if the size is smaller (less than 8 bytes) then we can write in fixed
> length encoder instead of LV encode it can save a lot of space and memory.
> 2. If the size is less than 32000 then use current our string datatype.
> 3. If size is more than 32000 then encode using int as a length in LV
> format.
>
> In spark dataframe support we can by default use string as datatype.
>
> Even if we take option 1 also carbon should internally has new datatype
> otherwise code will not be good as you need to check this property many
> places so ideally new datatype can lead to a new set of implementations and
> easier to code and maintain.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

xuchuanyin

Re: Grammar about supporting string longer than 32000 characters

In traditional RDBMS, varchar(N) means the value contains at least N characters, at the DBMS will truncate the value if its length is longer than N.

Will we implement like this too? Truncate the string value to N if its length is longer than N?

xm_zzc

Re: Grammar about supporting string longer than 32000 characters

+1 for solution 2
I think it's ok to truncate the string value to N if its length is longer
than N, it's acceptable.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

Re: Grammar about supporting string longer than 32000 characters

Spark dataframe does not have char/varchar datatype, it only has string type.

Suppose we are saving spark dataframe to carbonata, if the length of value is longer than 32000 and its datatype is string, how will carbondata know to store it as longstring?
In my opinion, a property to indicate which column is longstring is inevitable in the case.

ravipesala

Re: Grammar about supporting string longer than 32000 characters

In case of dataframe we can take the varchar(max) as default.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/