[jira] [Created] (CARBONDATA-3637) Improve insert into performance and decrease memory foot print

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (CARBONDATA-3637) Improve insert into performance and decrease memory foot print

Akash R Nilugal (Jira)
Ajantha Bhat created CARBONDATA-3637:
----------------------------------------

             Summary: Improve insert into performance and decrease memory foot print
                 Key: CARBONDATA-3637
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-3637
             Project: CarbonData
          Issue Type: Improvement
            Reporter: Ajantha Bhat
            Assignee: Ajantha Bhat


Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
Load process has steps like parsing and converter step with bad record
support.
Insert into doesn't require these steps as data is already validated and
converted from source table or dataframe.

Some identified changes are below.

1. Need to refactor and separate load and insert at driver side to skip
converter step and unify flow for No sort and global sort insert.
2. Need to avoid reorder of each row. By changing select dataframe's
projection order itself during the insert into.
3. For carbon to carbon insert, need to provide the ReadSupport and use
RecordReader (vector reader currently doesn't support ReadSupport) to
handle null values, time stamp cutoff (direct dictionary) from scanRDD
result.
4. Need to handle insert into partition/non-partition table in local sort,
global sort, no sort, range columns, compaction flow.

The final goal is to improve insert performance by keeping only required
logic and also decrease the memory footprint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)