Apache CarbonData Dev Mailing List archive

Re: Optimize and refactor insert into command

Posted by ravipesala on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Optimize-and-refactor-insert-into-command-tp88449p89499.html

Hi,

+1
It’s a long pending work. Most welcome.

Regards,
Ravindra.

On Fri, 20 Dec 2019 at 7:55 AM, Ajantha Bhat <[hidden email]> wrote:

> Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
> Load process has steps like parsing and converter step with bad record
> support.
> Insert into doesn't require these steps as data is already validated and
> converted from source table or dataframe.
>
> Some identified changes are below.
>
> 1. Need to refactor and separate load and insert at driver side to skip
> converter step and unify flow for No sort and global sort insert.
> 2. Need to avoid reorder of each row. By changing select dataframe's
> projection order itself during the insert into.
> 3. For carbon to carbon insert, need to provide the ReadSupport and use
> RecordReader (vector reader currently doesn't support ReadSupport) to
> handle null values, time stamp cutoff (direct dictionary) from scanRDD
> result.
> 4. Need to handle insert into partition/non-partition table in local sort,
> global sort, no sort, range columns, compaction flow.
>
> The final goal is to improve insert performance by keeping only required
> logic and also decrease the memory footprint.
>
> If you have any other suggestions or optimizations related to this let me
> know.
>
> Thanks,
> Ajantha
>

--
Thanks & Regards,
Ravi