Hi Community,Currently CarbonData supports global dictionary or
No-Dictionary (Plain-Text stored in LV format) for storing dimension column data. *Bottleneck with Global Dictionary* 1. As dictionary file is mutable file, so it is not possible to support global dictionary in storage environment which does not support append. 2. It’s difficult for user to determine whether the column should be dictionary or not if number of columns in table is high. 3. Global dictionary generation generally slows down the load process *Bottleneck with No-Dictionary* 1. Storage size is high 2. Query on No-Dictionary column is slower as data read/processed is more 3. Filtering is slower on No-Dictionary columns as number of comparison is high 4. Memory footprint is high The above bottlenecks can be solved by *Generating Local dictionary for low cardinality columns at each blocklet level, *which will help to achieve below benefits: 1. This will help in supporting dictionary generation on different storage environment irrespective of its supported operations(append) on the files. 2. Reduces the extra IO operations read/write on the dictionary files generated in case of global dictionary. 3. It will eliminate the problem for user to identify the dictionary columns when the number of columns are more in a table. 4. It helps in getting more compression on dimension columns with less cardinality. 5. Filter query on No-dictionary columns with local dictionary will be faster as filter will be done on encoded data. 6. It will help in reducing the store size and memory footprint as only unique values will be stored as part of local dictionary and corresponding data will be stored as encoded data. Please provide your comment. Any suggestion from community is most welcomed. Please let me know for any clarification. -Regards Kumar Vishal
kumar vishal
|
This post was updated on .
Hi, Kumar:
Local dictionary will be nice feature since other formats like parquet all support this. My concern is that: How will you implement this feature? 1. What's the scope of the `local`? Page level (for all containing rows), Blocklet level (for all containing pages), Block level(for all containing blocklets)? 2. Where will you store the local dictionary? 3. How do you decide to enable the local dictionary for a column? 4. Have you considered to fall back to plain encoding if the local dictionary encoding consumes more space? 5. Will you still work on V3 format or start a new V4 (or v3.1) version? Anyway, I'm concerning about the data loading performance. Please pay attention to it while you are implementing this feature. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Vishal,
+1 Thank you for starting a discussion on it. It will be a very helpful feature to improve query performance and reduces the memory footprint. Please add the design document for the same. Regards, Ravindra. On 5 June 2018 at 09:22, xuchuanyin <[hidden email]> wrote: > Hi, Kumar: > Local dictionary will be nice feature and other formats like parquet all > support this. > > My concern is that: How will you implement this feature? > > 1. What's the scope of the `local`? Page level (for all containing rows), > Blocklet level (for all containing pages), Block level(for all containing > blocklets)? > > 2. Where will you store the local dictionary? > > 3. How do you decide to enable the local dictionary for a column? > > 4. Have you considered to fall back to plain encoding if the local > dictionary encoding consumes more space? > > 5. Will you still work on V3 format or start a new V4 (or v3.1) version? > > Anyway, I'm concerning about the data loading performance. Please pay > attention to it while you are implementing this feature. > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > n5.nabble.com/ > -- Thanks & Regards, Ravi |
In reply to this post by xuchuanyin
Hi Xuchuanyin,
I am working on design document, and all the points you have mentioned I have already captured. I will share once it is finished. -Regards Kumar Vishal On Tue, Jun 5, 2018 at 9:22 AM, xuchuanyin <[hidden email]> wrote: > Hi, Kumar: > Local dictionary will be nice feature and other formats like parquet all > support this. > > My concern is that: How will you implement this feature? > > 1. What's the scope of the `local`? Page level (for all containing rows), > Blocklet level (for all containing pages), Block level(for all containing > blocklets)? > > 2. Where will you store the local dictionary? > > 3. How do you decide to enable the local dictionary for a column? > > 4. Have you considered to fall back to plain encoding if the local > dictionary encoding consumes more space? > > 5. Will you still work on V3 format or start a new V4 (or v3.1) version? > > Anyway, I'm concerning about the data loading performance. Please pay > attention to it while you are implementing this feature. > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > n5.nabble.com/ >
kumar vishal
|
+1
It is a good feature to have. Once the design document is uploaded we will get a better idea of how it will be implemented. Regards Manish Gupta On Tue, Jun 5, 2018 at 11:18 AM, Kumar Vishal <[hidden email]> wrote: > Hi Xuchuanyin, > > I am working on design document, and all the points you have mentioned I > have already captured. I will share once it is finished. > > -Regards > Kumar Vishal > > On Tue, Jun 5, 2018 at 9:22 AM, xuchuanyin <[hidden email]> wrote: > > > Hi, Kumar: > > Local dictionary will be nice feature and other formats like parquet > all > > support this. > > > > My concern is that: How will you implement this feature? > > > > 1. What's the scope of the `local`? Page level (for all containing > rows), > > Blocklet level (for all containing pages), Block level(for all containing > > blocklets)? > > > > 2. Where will you store the local dictionary? > > > > 3. How do you decide to enable the local dictionary for a column? > > > > 4. Have you considered to fall back to plain encoding if the local > > dictionary encoding consumes more space? > > > > 5. Will you still work on V3 format or start a new V4 (or v3.1) > version? > > > > Anyway, I'm concerning about the data loading performance. Please pay > > attention to it while you are implementing this feature. > > > > > > > > -- > > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > > n5.nabble.com/ > > > |
In reply to this post by kumarvishal09
Hi:
+1. This is an exciting feature, hope to have it in version 1.5. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by kumarvishal09
+1
Good feature to add in CarbonData Regards, Jacky > 在 2018年6月4日,下午11:10,Kumar Vishal <[hidden email]> 写道: > > Hi Community,Currently CarbonData supports global dictionary or > No-Dictionary (Plain-Text stored in LV format) for storing dimension column > data. > > *Bottleneck with Global Dictionary* > > 1. > > As dictionary file is mutable file, so it is not possible to support > global dictionary in storage environment which does not support append. > 2. > > It’s difficult for user to determine whether the column should be > dictionary or not if number of columns in table is high. > 3. > > Global dictionary generation generally slows down the load process > > *Bottleneck with No-Dictionary* > > 1. > > Storage size is high > 2. > > Query on No-Dictionary column is slower as data read/processed is more > 3. > > Filtering is slower on No-Dictionary columns as number of comparison is > high > 4. > > Memory footprint is high > > The above bottlenecks can be solved by *Generating Local dictionary for low > cardinality columns at each blocklet level, *which will help to achieve > below benefits: > > 1. > > This will help in supporting dictionary generation on different storage > environment irrespective of its supported operations(append) on the files. > 2. > > Reduces the extra IO operations read/write on the dictionary files > generated in case of global dictionary. > 3. > > It will eliminate the problem for user to identify the dictionary > columns when the number of columns are more in a table. > 4. > > It helps in getting more compression on dimension columns with less > cardinality. > 5. > > Filter query on No-dictionary columns with local dictionary will be > faster as filter will be done on encoded data. > 6. > > It will help in reducing the store size and memory footprint as only > unique values will be stored as part of local dictionary and > corresponding data will be stored as encoded data. > > Please provide your comment. Any suggestion from community is most > welcomed. Please let me know for any clarification. > > -Regards > Kumar Vishal |
Hi Community, Please find the Attached Local dictionary support design document. Please let me know for any further clarification on design document. Any further inputs/improvements are most welcomed. -Regards Kumar Vishal On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <[hidden email]> wrote: +1
kumar vishal
|
Hi All,
Please find the link for design doc. https://drive.google.com/file/d/1eqfIms2tMi3b63nMbKfGRZYmo7TMy E1_/view?usp=sharing -Regards Kumar Vishal On Wed, Jun 6, 2018 at 2:25 PM, Kumar Vishal <[hidden email]> wrote: > Hi Community, > > Please find the Attached Local dictionary support design document. Please > let me know for any further clarification on design document. > Any further inputs/improvements are most welcomed. > > > > -Regards > Kumar Vishal > > On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <[hidden email]> wrote: > >> +1 >> Good feature to add in CarbonData >> >> Regards, >> Jacky >> >> >> > 在 2018年6月4日,下午11:10,Kumar Vishal <[hidden email]> 写道: >> > >> > Hi Community,Currently CarbonData supports global dictionary or >> > No-Dictionary (Plain-Text stored in LV format) for storing dimension >> column >> > data. >> > >> > *Bottleneck with Global Dictionary* >> > >> > 1. >> > >> > As dictionary file is mutable file, so it is not possible to support >> > global dictionary in storage environment which does not support >> append. >> > 2. >> > >> > It’s difficult for user to determine whether the column should be >> > dictionary or not if number of columns in table is high. >> > 3. >> > >> > Global dictionary generation generally slows down the load process >> > >> > *Bottleneck with No-Dictionary* >> > >> > 1. >> > >> > Storage size is high >> > 2. >> > >> > Query on No-Dictionary column is slower as data read/processed is more >> > 3. >> > >> > Filtering is slower on No-Dictionary columns as number of comparison >> is >> > high >> > 4. >> > >> > Memory footprint is high >> > >> > The above bottlenecks can be solved by *Generating Local dictionary for >> low >> > cardinality columns at each blocklet level, *which will help to achieve >> > below benefits: >> > >> > 1. >> > >> > This will help in supporting dictionary generation on different >> storage >> > environment irrespective of its supported operations(append) on the >> files. >> > 2. >> > >> > Reduces the extra IO operations read/write on the dictionary files >> > generated in case of global dictionary. >> > 3. >> > >> > It will eliminate the problem for user to identify the dictionary >> > columns when the number of columns are more in a table. >> > 4. >> > >> > It helps in getting more compression on dimension columns with less >> > cardinality. >> > 5. >> > >> > Filter query on No-dictionary columns with local dictionary will be >> > faster as filter will be done on encoded data. >> > 6. >> > >> > It will help in reducing the store size and memory footprint as only >> > unique values will be stored as part of local dictionary and >> > corresponding data will be stored as encoded data. >> > >> > Please provide your comment. Any suggestion from community is most >> > welcomed. Please let me know for any clarification. >> > >> > -Regards >> > Kumar Vishal >> >> >> >> >
kumar vishal
|
Hi All,
Due to some problem above link is not working. Please find the updated link. https://drive.google.com/file/d/10LqtQlrE4jeotmleoMLJ8F91rK2TrN2h/view?usp=sharing -Regards Kumar Vishal On Wed, Jun 6, 2018 at 2:40 PM, Kumar Vishal <[hidden email]> wrote: > Hi All, > > Please find the link for design doc. > > https://drive.google.com/file/d/1eqfIms2tMi3b63nMbKfGRZYmo7T > MyE1_/view?usp=sharing > > -Regards > Kumar Vishal > > On Wed, Jun 6, 2018 at 2:25 PM, Kumar Vishal <[hidden email]> > wrote: > >> Hi Community, >> >> Please find the Attached Local dictionary support design document. Please >> let me know for any further clarification on design document. >> Any further inputs/improvements are most welcomed. >> >> >> >> -Regards >> Kumar Vishal >> >> On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <[hidden email]> wrote: >> >>> +1 >>> Good feature to add in CarbonData >>> >>> Regards, >>> Jacky >>> >>> >>> > 在 2018年6月4日,下午11:10,Kumar Vishal <[hidden email]> 写道: >>> > >>> > Hi Community,Currently CarbonData supports global dictionary or >>> > No-Dictionary (Plain-Text stored in LV format) for storing dimension >>> column >>> > data. >>> > >>> > *Bottleneck with Global Dictionary* >>> > >>> > 1. >>> > >>> > As dictionary file is mutable file, so it is not possible to support >>> > global dictionary in storage environment which does not support >>> append. >>> > 2. >>> > >>> > It’s difficult for user to determine whether the column should be >>> > dictionary or not if number of columns in table is high. >>> > 3. >>> > >>> > Global dictionary generation generally slows down the load process >>> > >>> > *Bottleneck with No-Dictionary* >>> > >>> > 1. >>> > >>> > Storage size is high >>> > 2. >>> > >>> > Query on No-Dictionary column is slower as data read/processed is >>> more >>> > 3. >>> > >>> > Filtering is slower on No-Dictionary columns as number of comparison >>> is >>> > high >>> > 4. >>> > >>> > Memory footprint is high >>> > >>> > The above bottlenecks can be solved by *Generating Local dictionary >>> for low >>> > cardinality columns at each blocklet level, *which will help to achieve >>> > below benefits: >>> > >>> > 1. >>> > >>> > This will help in supporting dictionary generation on different >>> storage >>> > environment irrespective of its supported operations(append) on the >>> files. >>> > 2. >>> > >>> > Reduces the extra IO operations read/write on the dictionary files >>> > generated in case of global dictionary. >>> > 3. >>> > >>> > It will eliminate the problem for user to identify the dictionary >>> > columns when the number of columns are more in a table. >>> > 4. >>> > >>> > It helps in getting more compression on dimension columns with less >>> > cardinality. >>> > 5. >>> > >>> > Filter query on No-dictionary columns with local dictionary will be >>> > faster as filter will be done on encoded data. >>> > 6. >>> > >>> > It will help in reducing the store size and memory footprint as only >>> > unique values will be stored as part of local dictionary and >>> > corresponding data will be stored as encoded data. >>> > >>> > Please provide your comment. Any suggestion from community is most >>> > welcomed. Please let me know for any clarification. >>> > >>> > -Regards >>> > Kumar Vishal >>> >>> >>> >>> >> >
kumar vishal
|
Hi All,
Please ignore above link. Please comment here: https://docs.google.com/document/d/1y0dJSWOr0ZTPpbNOOUfVfU5SoANL5B1F0l7jhl8BgUs/edit?usp=sharing -Regards Kumar Vishal On Wed, Jun 6, 2018 at 3:06 PM, Kumar Vishal <[hidden email]> wrote: > Hi All, > > Due to some problem above link is not working. Please find the updated > link. > > https://drive.google.com/file/d/10LqtQlrE4jeotmleoMLJ8F91rK2Tr > N2h/view?usp=sharing > > -Regards > Kumar Vishal > > On Wed, Jun 6, 2018 at 2:40 PM, Kumar Vishal <[hidden email]> > wrote: > >> Hi All, >> >> Please find the link for design doc. >> >> https://drive.google.com/file/d/1eqfIms2tMi3b63nMbKfGRZYmo7T >> MyE1_/view?usp=sharing >> >> -Regards >> Kumar Vishal >> >> On Wed, Jun 6, 2018 at 2:25 PM, Kumar Vishal <[hidden email]> >> wrote: >> >>> Hi Community, >>> >>> Please find the Attached Local dictionary support design document. >>> Please let me know for any further clarification on design document. >>> Any further inputs/improvements are most welcomed. >>> >>> >>> >>> -Regards >>> Kumar Vishal >>> >>> On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <[hidden email]> wrote: >>> >>>> +1 >>>> Good feature to add in CarbonData >>>> >>>> Regards, >>>> Jacky >>>> >>>> >>>> > 在 2018年6月4日,下午11:10,Kumar Vishal <[hidden email]> 写道: >>>> > >>>> > Hi Community,Currently CarbonData supports global dictionary or >>>> > No-Dictionary (Plain-Text stored in LV format) for storing dimension >>>> column >>>> > data. >>>> > >>>> > *Bottleneck with Global Dictionary* >>>> > >>>> > 1. >>>> > >>>> > As dictionary file is mutable file, so it is not possible to support >>>> > global dictionary in storage environment which does not support >>>> append. >>>> > 2. >>>> > >>>> > It’s difficult for user to determine whether the column should be >>>> > dictionary or not if number of columns in table is high. >>>> > 3. >>>> > >>>> > Global dictionary generation generally slows down the load process >>>> > >>>> > *Bottleneck with No-Dictionary* >>>> > >>>> > 1. >>>> > >>>> > Storage size is high >>>> > 2. >>>> > >>>> > Query on No-Dictionary column is slower as data read/processed is >>>> more >>>> > 3. >>>> > >>>> > Filtering is slower on No-Dictionary columns as number of >>>> comparison is >>>> > high >>>> > 4. >>>> > >>>> > Memory footprint is high >>>> > >>>> > The above bottlenecks can be solved by *Generating Local dictionary >>>> for low >>>> > cardinality columns at each blocklet level, *which will help to >>>> achieve >>>> > below benefits: >>>> > >>>> > 1. >>>> > >>>> > This will help in supporting dictionary generation on different >>>> storage >>>> > environment irrespective of its supported operations(append) on the >>>> files. >>>> > 2. >>>> > >>>> > Reduces the extra IO operations read/write on the dictionary files >>>> > generated in case of global dictionary. >>>> > 3. >>>> > >>>> > It will eliminate the problem for user to identify the dictionary >>>> > columns when the number of columns are more in a table. >>>> > 4. >>>> > >>>> > It helps in getting more compression on dimension columns with less >>>> > cardinality. >>>> > 5. >>>> > >>>> > Filter query on No-dictionary columns with local dictionary will be >>>> > faster as filter will be done on encoded data. >>>> > 6. >>>> > >>>> > It will help in reducing the store size and memory footprint as only >>>> > unique values will be stored as part of local dictionary and >>>> > corresponding data will be stored as encoded data. >>>> > >>>> > Please provide your comment. Any suggestion from community is most >>>> > welcomed. Please let me know for any clarification. >>>> > >>>> > -Regards >>>> > Kumar Vishal >>>> >>>> >>>> >>>> >>> >> >
kumar vishal
|
Hi, Kumar:
Can you raise a Jira and provide the document as attachment? I cannot open the links since it is blocked. |
Hi Xuchuanyin,
Please find the JIRA link for local dictionary support. https://issues.apache.org/jira/browse/CARBONDATA-2584 -Regards Kumar Vishal On Wed, Jun 6, 2018 at 6:25 PM, xuchuanyin <[hidden email]> wrote: > Hi, Kumar: > Can you raise a Jira and provide the document as attachment? I cannot > open the links since it is blocked.
kumar vishal
|
Hi Vishal,
Thanks for uploading the design document. The document is good and gives a detailed picture of the requirement. I have few questions and suggestions. Kindly consider if applicable. 1. Will the local dictionary be read once and put into offheap/onheap memory or for every query it will be read? 2. Will the columnCardinality integer array now contain the actual cardinality for no dictionary column in the block footer or in any other metadata? If not then we can store as it can be one of the statistics which can help in deciding pushdown for like queries on no dictionary column. 3. Apart from default threshold we can also define the max threshold for the local dictionary (lets say 1 lac). If user configures a value greater than max allowed threshold then we can consider max and continue. Regards Manish Gupta On Wed, Jun 6, 2018 at 6:54 PM, Kumar Vishal <[hidden email]> wrote: > Hi Xuchuanyin, > > Please find the JIRA link for local dictionary support. > > https://issues.apache.org/jira/browse/CARBONDATA-2584 > > -Regards > Kumar Vishal > > On Wed, Jun 6, 2018 at 6:25 PM, xuchuanyin <[hidden email]> wrote: > > > Hi, Kumar: > > Can you raise a Jira and provide the document as attachment? I cannot > > open the links since it is blocked. > |
About query filtering
1. “during filter, actual filter values will be generated using column local dictionary values...then filter will be applied on the dictionary encode data” --- If the filter is not 'equal' but 'like','greater than', can it also run on encode data. 2. "As dictionary data will be always of 4 bytes " --- Why they are 4 bytes? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Vishal,
Thanks for sharing the design and I have one question related to deciding on whether to generate the dictionary or not. If in first few loads we have the cardinality below the threshold then we will create a local dictionary, but if in subsequent loads the threshold value is breached than what will happen to the data of previous loads? Regards Bhavya On Thu, Jun 7, 2018 at 5:28 PM, xuchuanyin <[hidden email]> wrote: > About query filtering > > 1. “during filter, actual filter values will be generated using column > local > dictionary values...then filter will be applied on the dictionary encode > data” > --- > If the filter is not 'equal' but 'like','greater than', can it also run on > encode data. > > 2. "As dictionary data will be always of 4 bytes " > --- > Why they are 4 bytes? > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > n5.nabble.com/ > -- *Bhavya Aggarwal* Sr. Director Knoldus Inc. <http://www.knoldus.com/> +91-9910483067 Canada - USA - India - Singapore <https://in.linkedin.com/company/knoldus> <https://twitter.com/Knolspeak> <https://www.facebook.com/KnoldusSoftware/> <https://blog.knoldus.com/> |
In reply to this post by xuchuanyin
Hi xuchuanyin,
Please find my comments inline About query filtering 1. “during filter, actual filter values will be generated using column local dictionary values...then filter will be applied on the dictionary encode data” --- If the filter is not 'equal' but 'like','greater than', can it also run on encode data. *For range type of filters , it will be same as the way global dictionary column is handled.* 2. "As dictionary data will be always of 4 bytes " --- Why they are 4 bytes? *Dictionary value/data is nothing but integer value assigned to the dictionary key. So it will of 4 bytes.* -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by bhavya411
Hi bhavya,
Local dictionary generation is task level. if in ongoing load, if the threshold is breached, then for that load the local dictionary will not be generated for that corresponding column and there is no dependency with the previous loads. For each load new local dictionary will be generated. Regards, Akash r Nilugal -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by kumarvishal09
Dear Vishal,
Please find the queries/comments on the design doc. 1. If user is giving any invalid value, default threshold(1000 unique values) value will be considered. What is the consideration behind the default value 1000. 2. There is no option mentioned for the user to alter the table if the ENABLE_LOCAL_DICT and CARBON_LOCALDICT_THRESHOLD values are set. This would also help in compatibility if we want to generate local dictionary for table created in previous carbon versions. 3. There should be validation provided if the user inputs ENABLE_LOCAL_DICT as false and tries to set CARBON_LOCALDICT_THRESHOLD value. 4. Impact of alter table add/drop/change type of column is not mentioned . 5. would complex types also be considered for local dictionary. 6. For any column if dictionary values crosses the threshold (carbon_localdict_threshold), then it will drop dictionary for that column. Could not understand “drop dictionary for that column” 7. For better testability information regarding generation and updation of local dictionary can be logged. Regards Chetan -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
1. If user is giving any invalid value, default threshold(1000 unique values)
value will be considered. What is the consideration behind the default value 1000. *1000 is a random value we have mentioned in design doc. CARBON_LOCALDICT_THRESHOLD is exposed to user for setting threshold value based on their usecase* 2. There is no option mentioned for the user to alter the table if the ENABLE_LOCAL_DICT and CARBON_LOCALDICT_THRESHOLD values are set. This would also help in compatibility if we want to generate local dictionary for table created in previous versions. *In new load for old table local dictionary will be generated as by default local dictionary generation is enabled. Alter command for setting CARBON_LOCALDICT_THRESHOLD and ENABLE_LOCAL_DICT property will be provided for older tables and This will be updated in desing doc. Thank you for pointing this out* 3.There should be validation provided if the user inputs ENABLE_LOCAL_DICT as false and tries to set CARBON_LOCALDICT_THRESHOLD value. *will not consider Threshold value if ENABLE_LOCAL_DICT is false* 4.Impact of alter table add/drop/change type of column is not mentioned . *There is no impact that’s why not captured in design doc's Impact analysis section* 5.Would complex types also be considered for local dictionary. * it will be handled for complex primitive no dictionary String data type columns* 6.For any column if dictionary values crosses the threshold (carbon_localdict_threshold), then it will drop dictionary for that column. Could not understand “drop dictionary for that column” * Local dictionary will not be considered for respective column* 7.For better testability information regarding generation and updation of local dictionary can be logged. *Log will be added for each level of local dictionary generation.* -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |