[Discussion] Partition Optimization

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discussion] Partition Optimization

maheshrajus
Dear Community,

This mail is regarding partition optimization.

*Current behaviour:* Currently partition column information is storing in
data files after load/insert. When we query for partition data we are
fetching from data files and filling the row.

*Proposed optimization:* In this enhancement the idea is to remove/exclude
partition column information while loading/insert[writing]. it means data
files does not contain any partition column information. When we query for
partition data[readers] fill the partition information with help from
projection partiton columns[pass to BlockExecutionInfo and get it] and
blockId[which has partition column name and value] and fill the row and
return.

*Benefits*:
1) query performance should be faster
2) store size should be less compare to old behavior.

Please have a look *WIP PR[#1]* is raised for the same and we are working
on CI failures currently.

#1 https://github.com/apache/carbondata/pull/3695/

Please provide your valuable inputs and suggestions. Thank you in advance !

Thanks & Regards
-Mahesh Raju Somalaraju
github id: maheshrajus
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Partition Optimization

Ajantha Bhat
+1,

Not keeping the partition values as a column (as the folder name already
has it) is a great way to reduce the store size.
we might have to handle compatibility and support refresh table also.

Apache Iceberg has a bit matured concept called *hidden partitioning, *where
they also maintain the relationship between columns and supports dynamic
rollup of partitions based on the query. You can analyze this (
https://iceberg.apache.org/partitioning/)

Thanks,
Ajantha

On Thu, Oct 15, 2020 at 2:22 AM Mahesh Raju Somalaraju <
[hidden email]> wrote:

> Dear Community,
>
> This mail is regarding partition optimization.
>
> *Current behaviour:* Currently partition column information is storing in
> data files after load/insert. When we query for partition data we are
> fetching from data files and filling the row.
>
> *Proposed optimization:* In this enhancement the idea is to remove/exclude
> partition column information while loading/insert[writing]. it means data
> files does not contain any partition column information. When we query for
> partition data[readers] fill the partition information with help from
> projection partiton columns[pass to BlockExecutionInfo and get it] and
> blockId[which has partition column name and value] and fill the row and
> return.
>
> *Benefits*:
> 1) query performance should be faster
> 2) store size should be less compare to old behavior.
>
> Please have a look *WIP PR[#1]* is raised for the same and we are working
> on CI failures currently.
>
> #1 https://github.com/apache/carbondata/pull/3695/
>
> Please provide your valuable inputs and suggestions. Thank you in advance !
>
> Thanks & Regards
> -Mahesh Raju Somalaraju
> github id: maheshrajus
>
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Partition Optimization

akashrn5
In reply to this post by maheshrajus
Hi,

+1.

Its a long time pending work, good to complete it now.

As ajantha said you can have a look at iceberg hidden partitioning, but this
is just about not storing partition data in files and faster query and low
storage.
You can analyze and suggest the improvement in another discussion like
time and date relation in partitioning etc.

Thanks,

Regards,
Akash R Nilugal



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Partition Optimization

kumarvishal09
Hi,
To make .carbondata file as a self sufficient file we added partition
column as a data column . Partition suppose to be a Low cardinality column
and we are applying RLE on top of that so size difference suppose to be in
bytes. How much difference you are getting in terms of size after removing
it??

Regards
Kumar Vishal

On Thu, 29 Oct 2020 at 5:04 PM, akashrn5 <[hidden email]> wrote:

> Hi,
>
> +1.
>
> Its a long time pending work, good to complete it now.
>
> As ajantha said you can have a look at iceberg hidden partitioning, but
> this
> is just about not storing partition data in files and faster query and low
> storage.
> You can analyze and suggest the improvement in another discussion like
> time and date relation in partitioning etc.
>
> Thanks,
>
> Regards,
> Akash R Nilugal
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Partition Optimization

David CaiQiang
Agree with Vishal, better to test and confirm the difference.



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai