Apache CarbonData Dev Mailing List archive

Re: [Discussion] Carbon Local Dictionary Support

Posted by kumarvishal09 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Carbon-Local-Dictionary-Support-tp51447p51541.html

Hi Community,

Please find the Attached Local dictionary support design document. Please let me know for any further clarification on design document.

Any further inputs/improvements are most welcomed.

-Regards

Kumar Vishal

On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <[hidden email]> wrote:

+1
Good feature to add in CarbonData

Regards,
Jacky

> 在 2018年6月4日，下午11:10，Kumar Vishal <[hidden email]> 写道：
>
> Hi Community,Currently CarbonData supports global dictionary or
> No-Dictionary (Plain-Text stored in LV format) for storing dimension column
> data.
>
> *Bottleneck with Global Dictionary*
>
> 1.
>
> As dictionary file is mutable file, so it is not possible to support
> global dictionary in storage environment which does not support append.
> 2.
>
> It’s difficult for user to determine whether the column should be
> dictionary or not if number of columns in table is high.
> 3.
>
> Global dictionary generation generally slows down the load process
>
> *Bottleneck with No-Dictionary*
>
> 1.
>
> Storage size is high
> 2.
>
> Query on No-Dictionary column is slower as data read/processed is more
> 3.
>
> Filtering is slower on No-Dictionary columns as number of comparison is
> high
> 4.
>
> Memory footprint is high
>
> The above bottlenecks can be solved by *Generating Local dictionary for low
> cardinality columns at each blocklet level, *which will help to achieve
> below benefits:
>
> 1.
>
> This will help in supporting dictionary generation on different storage
> environment irrespective of its supported operations(append) on the files.
> 2.
>
> Reduces the extra IO operations read/write on the dictionary files
> generated in case of global dictionary.
> 3.
>
> It will eliminate the problem for user to identify the dictionary
> columns when the number of columns are more in a table.
> 4.
>
> It helps in getting more compression on dimension columns with less
> cardinality.
> 5.
>
> Filter query on No-dictionary columns with local dictionary will be
> faster as filter will be done on encoded data.
> 6.

>
> It will help in reducing the store size and memory footprint as only
> unique values will be stored as part of local dictionary and
> corresponding data will be stored as encoded data.
>
> Please provide your comment. Any suggestion from community is most
> welcomed. Please let me know for any clarification.
>
> -Regards
> Kumar Vishal

kumar vishal