Improving Non-dictionary storage & performance.
Posted by ravipesala on Mar 01, 2017; 12:34pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improving-Non-dictionary-storage-performance-tp8146.html
Hi,
In order to make non-dictionary columns storage and performance more
efficient, I am suggesting following improvements.
1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always direct dictionary.
Right now only date and timestamp are direct dictionary columns. We can
make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these columns
are included in SORT_COLUMNS
2. Consider delta/value compression while storing direct dictionary values.
Right now it always uses INT datatype to store direct dictionary values. So
we can consider value/Delta compression to compact the storage.
3. Use the Separator instead of LV format to store String value in
no-dictionary format.
Currently String datatypes for non-dictionary colums are stored as
LV(length value) format, here we are using Short(2 bytes) as length always.
In order to keep storage compact we can use separator (0 byte as separator)
it just takes single byte. And while reading we can traverse through data
and get the offsets like we are doing now.
4. Add Range filters for no-dictionary columns.
Currently range filters like greater/ less than filters are not implemented
for no-dictionary columns. So we should implement them to avoid row level
filter and improve the performance.
Regards,
Ravindra.