Apache CarbonData Dev Mailing List archive

[Discussion] Carbondata Store size optimization

Posted by kumarvishal09 on Sep 12, 2018; 9:09am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Carbondata-Store-size-optimization-tp62283.html

Hi All,
I am working on below carbondata store size optimization to reduce the size
of the carbondata file which will improve IO performance during query.

*1. String/Varchar store size optimization*
*Problem:*
Currently String/Varchar data type values are stored in LV format in
carbondata file and during query, first it calculates offset(position of
each cell value) of each value in a page, which is impacting query
performance and storage size is also high as we cannot apply any encoding
on length part as it is stored along with the data.
*Solution:*
Store length part separately from data part and apply adaptive on length.
This will optimize store size and during query offset calculation will be
much faster as only need to look in length pat. It will improve query
performance.

*2. Adaptive encoding for Global/Direct/Local dictionary columns*
*Problem:*
Global/Direct/Local dictionary are stored in binary format and only snappy
is applied for compression. As Global/Direct/Local dictionary values are of
Integer data type it can adaptability stored with the data type smaller
than Integer.
*Solution:*
Add adaptive for global/direct dictionary column to reduce the store size.

*3. Local dictionary for Primitive data type columns*
Currently in carbondata local dictionary is not supported for primitive
columns(supported only for String datatype column). For low cardinality
columns, local dictionary encoding will be effective and adaptive can be
applied on top it. It will reduce the store size.

Any suggestion from community is most welcomed.

-Regards
Kumar Vishal

kumar vishal