[jira] [Created] (CARBONDATA-2584) CarbonData Local Dictionary Support

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (CARBONDATA-2584) CarbonData Local Dictionary Support

Akash R Nilugal (Jira)
kumar vishal created CARBONDATA-2584:
----------------------------------------

             Summary: CarbonData Local Dictionary Support
                 Key: CARBONDATA-2584
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2584
             Project: CarbonData
          Issue Type: New Feature
            Reporter: kumar vishal


Currently CarbonData supports global dictionary or No-Dictionary (Plain-Text stored in LV format) for storing dimension column data.

*Bottleneck with Global Dictionary*

It’s difficult for user to determine whether the column should be dictionary or not if number of columns in table is high.

Global dictionary generation generally slows down the load process.

Multiple IO operations are made during load even though dictionary already exists.

During query, multiple IO operations done for reading dictionary files and carbondata files.

*Bottleneck with No-Dictionary*

Storage size is high as we store the data in LV format

Query on No-Dictionary column is slower as data read/processed is more

Filtering is slower on No-Dictionary columns as number of comparison is high

Memory footprint is high

*The above bottlenecks can be solved by generating dictionary for low cardinality columns at each blocklet level, which will help to achieve below benefits:*

Reduces the extra IO operations read/write on the dictionary files generated in case of global dictionary.

It will eliminate the problem for user to identify the dictionary columns when the number of columns are more in a table.

It helps in getting more compression on dimension columns with less cardinality.

Filter queries and full scan queries on No-dictionary columns with local dictionary will be faster as filter will be done on encoded data.

It will help in reducing the store size and memory footprint as only unique values will be stored {color:#000000}as {color}part of local dictionary and corresponding data will be stored as encoded data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)