Re: [Discussion] Partition Optimization

Posted by Ajantha Bhat on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Partition-Optimization-tp101998p102976.html

+1,

Not keeping the partition values as a column (as the folder name already
has it) is a great way to reduce the store size.
we might have to handle compatibility and support refresh table also.

Apache Iceberg has a bit matured concept called *hidden partitioning, *where
they also maintain the relationship between columns and supports dynamic
rollup of partitions based on the query. You can analyze this (
https://iceberg.apache.org/partitioning/)

Thanks,
Ajantha

On Thu, Oct 15, 2020 at 2:22 AM Mahesh Raju Somalaraju <
[hidden email]> wrote:

> Dear Community,
>
> This mail is regarding partition optimization.
>
> *Current behaviour:* Currently partition column information is storing in
> data files after load/insert. When we query for partition data we are
> fetching from data files and filling the row.
>
> *Proposed optimization:* In this enhancement the idea is to remove/exclude
> partition column information while loading/insert[writing]. it means data
> files does not contain any partition column information. When we query for
> partition data[readers] fill the partition information with help from
> projection partiton columns[pass to BlockExecutionInfo and get it] and
> blockId[which has partition column name and value] and fill the row and
> return.
>
> *Benefits*:
> 1) query performance should be faster
> 2) store size should be less compare to old behavior.
>
> Please have a look *WIP PR[#1]* is raised for the same and we are working
> on CI failures currently.
>
> #1 https://github.com/apache/carbondata/pull/3695/
>
> Please provide your valuable inputs and suggestions. Thank you in advance !
>
> Thanks & Regards
> -Mahesh Raju Somalaraju
> github id: maheshrajus
>