Apache CarbonData Dev Mailing List archive

Re: [Discussion] Carbon Local Dictionary Support

Posted by kumarvishal09 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Carbon-Local-Dictionary-Support-tp51447p51543.html

Hi All,

Due to some problem above link is not working. Please find the updated link.

https://drive.google.com/file/d/10LqtQlrE4jeotmleoMLJ8F91rK2TrN2h/view?usp=sharing

-Regards
Kumar Vishal

On Wed, Jun 6, 2018 at 2:40 PM, Kumar Vishal <[hidden email]>
wrote:

> Hi All,
>
> Please find the link for design doc.
>
> https://drive.google.com/file/d/1eqfIms2tMi3b63nMbKfGRZYmo7T
> MyE1_/view?usp=sharing
>
> -Regards
> Kumar Vishal
>
> On Wed, Jun 6, 2018 at 2:25 PM, Kumar Vishal <[hidden email]>
> wrote:
>
>> Hi Community,
>>
>> Please find the Attached Local dictionary support design document. Please
>> let me know for any further clarification on design document.
>> Any further inputs/improvements are most welcomed.
>>
>>
>>
>> -Regards
>> Kumar Vishal
>>
>> On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <[hidden email]> wrote:
>>
>>> +1
>>> Good feature to add in CarbonData
>>>
>>> Regards,
>>> Jacky
>>>
>>>
>>> > 在 2018年6月4日，下午11:10，Kumar Vishal <[hidden email]> 写道：
>>> >
>>> > Hi Community,Currently CarbonData supports global dictionary or
>>> > No-Dictionary (Plain-Text stored in LV format) for storing dimension
>>> column
>>> > data.
>>> >
>>> > *Bottleneck with Global Dictionary*
>>> >
>>> > 1.
>>> >
>>> > As dictionary file is mutable file, so it is not possible to support
>>> > global dictionary in storage environment which does not support
>>> append.
>>> > 2.
>>> >
>>> > It’s difficult for user to determine whether the column should be
>>> > dictionary or not if number of columns in table is high.
>>> > 3.
>>> >
>>> > Global dictionary generation generally slows down the load process
>>> >
>>> > *Bottleneck with No-Dictionary*
>>> >
>>> > 1.
>>> >
>>> > Storage size is high
>>> > 2.
>>> >
>>> > Query on No-Dictionary column is slower as data read/processed is
>>> more
>>> > 3.
>>> >
>>> > Filtering is slower on No-Dictionary columns as number of comparison
>>> is
>>> > high
>>> > 4.
>>> >
>>> > Memory footprint is high
>>> >
>>> > The above bottlenecks can be solved by *Generating Local dictionary
>>> for low
>>> > cardinality columns at each blocklet level, *which will help to achieve
>>> > below benefits:
>>> >
>>> > 1.
>>> >
>>> > This will help in supporting dictionary generation on different
>>> storage
>>> > environment irrespective of its supported operations(append) on the
>>> files.
>>> > 2.
>>> >
>>> > Reduces the extra IO operations read/write on the dictionary files
>>> > generated in case of global dictionary.
>>> > 3.
>>> >
>>> > It will eliminate the problem for user to identify the dictionary
>>> > columns when the number of columns are more in a table.
>>> > 4.
>>> >
>>> > It helps in getting more compression on dimension columns with less
>>> > cardinality.
>>> > 5.
>>> >
>>> > Filter query on No-dictionary columns with local dictionary will be
>>> > faster as filter will be done on encoded data.
>>> > 6.
>>> >
>>> > It will help in reducing the store size and memory footprint as only
>>> > unique values will be stored as part of local dictionary and
>>> > corresponding data will be stored as encoded data.
>>> >
>>> > Please provide your comment. Any suggestion from community is most
>>> > welcomed. Please let me know for any clarification.
>>> >
>>> > -Regards
>>> > Kumar Vishal
>>>
>>>
>>>
>>>
>>
>

kumar vishal