[jira] [Updated] (CARBONDATA-2091) Enhance data loading performance by specifying range bounds for sort columns

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CARBONDATA-2091) Enhance data loading performance by specifying range bounds for sort columns

Akash R Nilugal (Jira)

     [ https://issues.apache.org/jira/browse/CARBONDATA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Indhumathi Muthumurugesh updated CARBONDATA-2091:
-------------------------------------------------
    Fix Version/s:     (was: 2.0.2)
                   2.0.1

> Enhance data loading performance by specifying range bounds for sort columns
> ----------------------------------------------------------------------------
>
>                 Key: CARBONDATA-2091
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2091
>             Project: CarbonData
>          Issue Type: Improvement
>            Reporter: Chuanyin Xu
>            Assignee: Chuanyin Xu
>            Priority: Major
>             Fix For: 2.0.1
>
>          Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Currently in carbondata, data loading using node_sort (also known as local_sort) has the following procedures:
>  # convert the input data in batch. (*Convert*)
>  # sort the batch and write to the sort temp files. (*TempSort*)
>  # combine the sort temp files and do merge sort to get a bigger ordered sort temp file. (*MergeSort*)
>  # combine all the sort temp files and do a final sort, its results will feed the next procedure. (*FinalSort*)
>  # get rows in order and convert rows to carbondata columnar format pages. (*produce*)
>  # Write bundles of pages to files and write the corresponding index file. (*consume*)
> The Step1~Step3 are done concurrently using multi-thread. The Step4 is done using only one thread. The Step5 is done using multi-thread. So the Step4 is the bottleneck among all the procedures. When observing the data loading performance, we can see that the CPU usage after Step3 is low.
>  
> We can enhance the data loading performance by parallelizing Step4.
>  
> User can specify range bounds for the sort columns and carbondata internally distributes the records to different ranges and process the data concurrently in different ranges.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)