Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[jira] [Updated] (CARBONDATA-754) order by query's performance is very bad

Classic

List

Threaded

1 message

Akash R Nilugal (Jira)

[jira] [Updated] (CARBONDATA-754) order by query's performance is very bad

[ https://issues.apache.org/jira/browse/CARBONDATA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jarck updated CARBONDATA-754:
-----------------------------
Request participants: (was: )
Description:
currently the order by dimension query's performance is very bad if there is no filter or filtered data is still to large.
if I was not wrong, it read all related data in carbon scan physical level, decode the sort dimension's data and sort all of them in spark sql sort physical plan.

I think we can optimize as below:

1. push down sort (+limit) to carbon scan

2. leverage the dimension's stored by nature order feature in blocklet level to get a sorted data in each partition

3. implements merge-sort/TopN in the spark's sort physical plan

actually I haveI optimized for "order by only 1 dimension + limit" base on branch 0.2. The performance is much better.
sort by 1 dimension +limit 10000 in 100 million data , it only take less than 1 second to get and print the result.

was:
currently the order by dimension query's performance is very bad if there is no filter or filtered data is still to large.
if I was not wrong, it read all related data in carbon scan physical level, decode the sort dimension's data and sort all of them in spark sql sort physical plan.

I think we can optimize as below:

1. push down sort (+limit) to carbon scan

2. leverage the dimension's stored by nature order feature in blocklet level to get a sorted data in each partition

3. implements merge-sort/TopN in the spark's sort physical plan

actually I haveI optimized for "order by only 1 dimension + limit" base on branch 0.2. The performance is much better.
sort by 1 dimension +limit 10000 in 100 million data , it only take less than 1 second to get and print the result.

1. push down

> order by query's performance is very bad
> ----------------------------------------
>
> Key: CARBONDATA-754
> URL: https://issues.apache.org/jira/browse/CARBONDATA-754
> Project: CarbonData
> Issue Type: Improvement
> Components: core, spark-integration
> Reporter: Jarck
> Assignee: Jarck
>
> currently the order by dimension query's performance is very bad if there is no filter or filtered data is still to large.
> if I was not wrong, it read all related data in carbon scan physical level, decode the sort dimension's data and sort all of them in spark sql sort physical plan.
> I think we can optimize as below:
> 1. push down sort (+limit) to carbon scan
> 2. leverage the dimension's stored by nature order feature in blocklet level to get a sorted data in each partition
> 3. implements merge-sort/TopN in the spark's sort physical plan
> actually I haveI optimized for "order by only 1 dimension + limit" base on branch 0.2. The performance is much better.
> sort by 1 dimension +limit 10000 in 100 million data , it only take less than 1 second to get and print the result.

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)