Hi All,
I am working on below carbondata store size optimization to reduce the size of the carbondata file which will improve IO performance during query. *1. String/Varchar store size optimization* *Problem:* Currently String/Varchar data type values are stored in LV format in carbondata file and during query, first it calculates offset(position of each cell value) of each value in a page, which is impacting query performance and storage size is also high as we cannot apply any encoding on length part as it is stored along with the data. *Solution:* Store length part separately from data part and apply adaptive on length. This will optimize store size and during query offset calculation will be much faster as only need to look in length pat. It will improve query performance. *2. Adaptive encoding for Global/Direct/Local dictionary columns* *Problem:* Global/Direct/Local dictionary are stored in binary format and only snappy is applied for compression. As Global/Direct/Local dictionary values are of Integer data type it can adaptability stored with the data type smaller than Integer. *Solution:* Add adaptive for global/direct dictionary column to reduce the store size. *3. Local dictionary for Primitive data type columns* Currently in carbondata local dictionary is not supported for primitive columns(supported only for String datatype column). For low cardinality columns, local dictionary encoding will be effective and adaptive can be applied on top it. It will reduce the store size. Any suggestion from community is most welcomed. -Regards Kumar Vishal
kumar vishal
|
+1
Compression is quite important for the scan performance. I think all your listed points are valid. Please feel free to contribute. Regards, Jacky > 在 2018年9月12日,下午5:09,Kumar Vishal <[hidden email]> 写道: > > Hi All, > I am working on below carbondata store size optimization to reduce the size > of the carbondata file which will improve IO performance during query. > > *1. String/Varchar store size optimization* > *Problem:* > Currently String/Varchar data type values are stored in LV format in > carbondata file and during query, first it calculates offset(position of > each cell value) of each value in a page, which is impacting query > performance and storage size is also high as we cannot apply any encoding > on length part as it is stored along with the data. > *Solution:* > Store length part separately from data part and apply adaptive on length. > This will optimize store size and during query offset calculation will be > much faster as only need to look in length pat. It will improve query > performance. > > *2. Adaptive encoding for Global/Direct/Local dictionary columns* > *Problem:* > Global/Direct/Local dictionary are stored in binary format and only snappy > is applied for compression. As Global/Direct/Local dictionary values are of > Integer data type it can adaptability stored with the data type smaller > than Integer. > *Solution:* > Add adaptive for global/direct dictionary column to reduce the store size. > > *3. Local dictionary for Primitive data type columns* > Currently in carbondata local dictionary is not supported for primitive > columns(supported only for String datatype column). For low cardinality > columns, local dictionary encoding will be effective and adaptive can be > applied on top it. It will reduce the store size. > > Any suggestion from community is most welcomed. > > -Regards > Kumar Vishal > |
Free forum by Nabble | Edit this page |