[
https://issues.apache.org/jira/browse/CARBONDATA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Indhumathi Muthumurugesh updated CARBONDATA-2091:
-------------------------------------------------
Fix Version/s: (was: 2.0.2)
2.0.1
> Enhance data loading performance by specifying range bounds for sort columns
> ----------------------------------------------------------------------------
>
> Key: CARBONDATA-2091
> URL:
https://issues.apache.org/jira/browse/CARBONDATA-2091> Project: CarbonData
> Issue Type: Improvement
> Reporter: Chuanyin Xu
> Assignee: Chuanyin Xu
> Priority: Major
> Fix For: 2.0.1
>
> Time Spent: 8h 40m
> Remaining Estimate: 0h
>
> Currently in carbondata, data loading using node_sort (also known as local_sort) has the following procedures:
> # convert the input data in batch. (*Convert*)
> # sort the batch and write to the sort temp files. (*TempSort*)
> # combine the sort temp files and do merge sort to get a bigger ordered sort temp file. (*MergeSort*)
> # combine all the sort temp files and do a final sort, its results will feed the next procedure. (*FinalSort*)
> # get rows in order and convert rows to carbondata columnar format pages. (*produce*)
> # Write bundles of pages to files and write the corresponding index file. (*consume*)
> The Step1~Step3 are done concurrently using multi-thread. The Step4 is done using only one thread. The Step5 is done using multi-thread. So the Step4 is the bottleneck among all the procedures. When observing the data loading performance, we can see that the CPU usage after Step3 is low.
>
> We can enhance the data loading performance by parallelizing Step4.
>
> User can specify range bounds for the sort columns and carbondata internally distributes the records to different ranges and process the data concurrently in different ranges.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)