Dear Community,
This mail is regarding partition optimization. *Current behaviour:* Currently partition column information is storing in data files after load/insert. When we query for partition data we are fetching from data files and filling the row. *Proposed optimization:* In this enhancement the idea is to remove/exclude partition column information while loading/insert[writing]. it means data files does not contain any partition column information. When we query for partition data[readers] fill the partition information with help from projection partiton columns[pass to BlockExecutionInfo and get it] and blockId[which has partition column name and value] and fill the row and return. *Benefits*: 1) query performance should be faster 2) store size should be less compare to old behavior. Please have a look *WIP PR[#1]* is raised for the same and we are working on CI failures currently. #1 https://github.com/apache/carbondata/pull/3695/ Please provide your valuable inputs and suggestions. Thank you in advance ! Thanks & Regards -Mahesh Raju Somalaraju github id: maheshrajus |
+1,
Not keeping the partition values as a column (as the folder name already has it) is a great way to reduce the store size. we might have to handle compatibility and support refresh table also. Apache Iceberg has a bit matured concept called *hidden partitioning, *where they also maintain the relationship between columns and supports dynamic rollup of partitions based on the query. You can analyze this ( https://iceberg.apache.org/partitioning/) Thanks, Ajantha On Thu, Oct 15, 2020 at 2:22 AM Mahesh Raju Somalaraju < [hidden email]> wrote: > Dear Community, > > This mail is regarding partition optimization. > > *Current behaviour:* Currently partition column information is storing in > data files after load/insert. When we query for partition data we are > fetching from data files and filling the row. > > *Proposed optimization:* In this enhancement the idea is to remove/exclude > partition column information while loading/insert[writing]. it means data > files does not contain any partition column information. When we query for > partition data[readers] fill the partition information with help from > projection partiton columns[pass to BlockExecutionInfo and get it] and > blockId[which has partition column name and value] and fill the row and > return. > > *Benefits*: > 1) query performance should be faster > 2) store size should be less compare to old behavior. > > Please have a look *WIP PR[#1]* is raised for the same and we are working > on CI failures currently. > > #1 https://github.com/apache/carbondata/pull/3695/ > > Please provide your valuable inputs and suggestions. Thank you in advance ! > > Thanks & Regards > -Mahesh Raju Somalaraju > github id: maheshrajus > |
In reply to this post by maheshrajus
Hi,
+1. Its a long time pending work, good to complete it now. As ajantha said you can have a look at iceberg hidden partitioning, but this is just about not storing partition data in files and faster query and low storage. You can analyze and suggest the improvement in another discussion like time and date relation in partitioning etc. Thanks, Regards, Akash R Nilugal -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi,
To make .carbondata file as a self sufficient file we added partition column as a data column . Partition suppose to be a Low cardinality column and we are applying RLE on top of that so size difference suppose to be in bytes. How much difference you are getting in terms of size after removing it?? Regards Kumar Vishal On Thu, 29 Oct 2020 at 5:04 PM, akashrn5 <[hidden email]> wrote: > Hi, > > +1. > > Its a long time pending work, good to complete it now. > > As ajantha said you can have a look at iceberg hidden partitioning, but > this > is just about not storing partition data in files and faster query and low > storage. > You can analyze and suggest the improvement in another discussion like > time and date relation in partitioning etc. > > Thanks, > > Regards, > Akash R Nilugal > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
kumar vishal
|
Agree with Vishal, better to test and confirm the difference.
----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
Free forum by Nabble | Edit this page |