http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Carbondata-Store-size-optimization-tp62283p62446.html
Compression is quite important for the scan performance. I think all your listed points are valid. Please feel free to contribute.
> 在 2018年9月12日,下午5:09,Kumar Vishal <
[hidden email]> 写道:
>
> Hi All,
> I am working on below carbondata store size optimization to reduce the size
> of the carbondata file which will improve IO performance during query.
>
> *1. String/Varchar store size optimization*
> *Problem:*
> Currently String/Varchar data type values are stored in LV format in
> carbondata file and during query, first it calculates offset(position of
> each cell value) of each value in a page, which is impacting query
> performance and storage size is also high as we cannot apply any encoding
> on length part as it is stored along with the data.
> *Solution:*
> Store length part separately from data part and apply adaptive on length.
> This will optimize store size and during query offset calculation will be
> much faster as only need to look in length pat. It will improve query
> performance.
>
> *2. Adaptive encoding for Global/Direct/Local dictionary columns*
> *Problem:*
> Global/Direct/Local dictionary are stored in binary format and only snappy
> is applied for compression. As Global/Direct/Local dictionary values are of
> Integer data type it can adaptability stored with the data type smaller
> than Integer.
> *Solution:*
> Add adaptive for global/direct dictionary column to reduce the store size.
>
> *3. Local dictionary for Primitive data type columns*
> Currently in carbondata local dictionary is not supported for primitive
> columns(supported only for String datatype column). For low cardinality
> columns, local dictionary encoding will be effective and adaptive can be
> applied on top it. It will reduce the store size.
>
> Any suggestion from community is most welcomed.
>
> -Regards
> Kumar Vishal
>