[Discussion] Carbondata Store size optimization

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discussion] Carbondata Store size optimization

kumarvishal09
Hi All,
I am working on below carbondata store size optimization to reduce the size
of the carbondata file which will improve IO performance during query.

*1. String/Varchar store size optimization*
*Problem:*
Currently String/Varchar data type values are stored in LV format in
carbondata file and during query, first it calculates offset(position of
each cell value) of each value in a page, which is impacting query
performance and storage size is also high as we cannot apply any encoding
on length part as it is stored along with the data.
*Solution:*
Store length part separately from data part and apply adaptive on length.
This will optimize store size and during query offset calculation will be
much faster as only need to look in length pat. It will improve query
performance.

*2. Adaptive encoding for Global/Direct/Local dictionary columns*
*Problem:*
Global/Direct/Local dictionary are stored  in binary format and only snappy
is applied for compression. As Global/Direct/Local dictionary values are of
Integer data type  it can adaptability stored with the data type smaller
than Integer.
*Solution:*
Add adaptive for global/direct dictionary column to reduce the store size.

*3. Local dictionary for Primitive data type columns*
Currently in carbondata local dictionary is not supported for primitive
columns(supported only for String datatype column). For low cardinality
columns, local dictionary encoding will be effective and adaptive can be
applied on top it. It will reduce the store size.

Any suggestion from community is most welcomed.

 -Regards
Kumar Vishal
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Carbondata Store size optimization

Jacky Li
+1

Compression is quite important for the scan performance. I think all your listed points are valid. Please feel free to contribute.

Regards,
Jacky

> 在 2018年9月12日,下午5:09,Kumar Vishal <[hidden email]> 写道:
>
> Hi All,
> I am working on below carbondata store size optimization to reduce the size
> of the carbondata file which will improve IO performance during query.
>
> *1. String/Varchar store size optimization*
> *Problem:*
> Currently String/Varchar data type values are stored in LV format in
> carbondata file and during query, first it calculates offset(position of
> each cell value) of each value in a page, which is impacting query
> performance and storage size is also high as we cannot apply any encoding
> on length part as it is stored along with the data.
> *Solution:*
> Store length part separately from data part and apply adaptive on length.
> This will optimize store size and during query offset calculation will be
> much faster as only need to look in length pat. It will improve query
> performance.
>
> *2. Adaptive encoding for Global/Direct/Local dictionary columns*
> *Problem:*
> Global/Direct/Local dictionary are stored  in binary format and only snappy
> is applied for compression. As Global/Direct/Local dictionary values are of
> Integer data type  it can adaptability stored with the data type smaller
> than Integer.
> *Solution:*
> Add adaptive for global/direct dictionary column to reduce the store size.
>
> *3. Local dictionary for Primitive data type columns*
> Currently in carbondata local dictionary is not supported for primitive
> columns(supported only for String datatype column). For low cardinality
> columns, local dictionary encoding will be effective and adaptive can be
> applied on top it. It will reduce the store size.
>
> Any suggestion from community is most welcomed.
>
> -Regards
> Kumar Vishal
>