[jira] [Resolved] (CARBONDATA-3637) Improve insert into performance and decrease memory foot print

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (CARBONDATA-3637) Improve insert into performance and decrease memory foot print

Akash R Nilugal (Jira)

     [ https://issues.apache.org/jira/browse/CARBONDATA-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacky Li resolved CARBONDATA-3637.
----------------------------------
    Fix Version/s: 2.0.0
       Resolution: Fixed

> Improve insert into performance and decrease memory foot print
> --------------------------------------------------------------
>
>                 Key: CARBONDATA-3637
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-3637
>             Project: CarbonData
>          Issue Type: Improvement
>            Reporter: Ajantha Bhat
>            Assignee: Ajantha Bhat
>            Priority: Major
>             Fix For: 2.0.0
>
>          Time Spent: 29h 20m
>  Remaining Estimate: 0h
>
> Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
> Load process has steps like parsing and converter step with bad record
> support.
> Insert into doesn't require these steps as data is already validated and
> converted from source table or dataframe.
> Some identified changes are below.
> 1. Need to refactor and separate load and insert at driver side to skip
> converter step and unify flow for No sort and global sort insert.
> 2. Need to avoid reorder of each row. By changing select dataframe's
> projection order itself during the insert into.
> 3. For carbon to carbon insert, need to provide the ReadSupport and use
> RecordReader (vector reader currently doesn't support ReadSupport) to
> handle null values, time stamp cutoff (direct dictionary) from scanRDD
> result.
> 4. Need to handle insert into partition/non-partition table in local sort,
> global sort, no sort, range columns, compaction flow.
> The final goal is to improve insert performance by keeping only required
> logic and also decrease the memory footprint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)