Posted by
kumarvishal09 on
Jun 04, 2018; 3:10pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Carbon-Local-Dictionary-Support-tp51447.html
Hi Community,Currently CarbonData supports global dictionary or
No-Dictionary (Plain-Text stored in LV format) for storing dimension column
data.
*Bottleneck with Global Dictionary*
1.
As dictionary file is mutable file, so it is not possible to support
global dictionary in storage environment which does not support append.
2.
It’s difficult for user to determine whether the column should be
dictionary or not if number of columns in table is high.
3.
Global dictionary generation generally slows down the load process
*Bottleneck with No-Dictionary*
1.
Storage size is high
2.
Query on No-Dictionary column is slower as data read/processed is more
3.
Filtering is slower on No-Dictionary columns as number of comparison is
high
4.
Memory footprint is high
The above bottlenecks can be solved by *Generating Local dictionary for low
cardinality columns at each blocklet level, *which will help to achieve
below benefits:
1.
This will help in supporting dictionary generation on different storage
environment irrespective of its supported operations(append) on the files.
2.
Reduces the extra IO operations read/write on the dictionary files
generated in case of global dictionary.
3.
It will eliminate the problem for user to identify the dictionary
columns when the number of columns are more in a table.
4.
It helps in getting more compression on dimension columns with less
cardinality.
5.
Filter query on No-dictionary columns with local dictionary will be
faster as filter will be done on encoded data.
6.
It will help in reducing the store size and memory footprint as only
unique values will be stored as part of local dictionary and
corresponding data will be stored as encoded data.
Please provide your comment. Any suggestion from community is most
welcomed. Please let me know for any clarification.
-Regards
Kumar Vishal
kumar vishal