[ https://issues.apache.org/jira/browse/CARBONDATA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarck updated CARBONDATA-754: ----------------------------- Request participants: (was: ) Description: currently the order by dimension query's performance is very bad if there is no filter or filtered data is still to large. if I was not wrong, it read all related data in carbon scan physical level, decode the sort dimension's data and sort all of them in spark sql sort physical plan. I think we can optimize as below: 1. push down sort (+limit) to carbon scan 2. leverage the dimension's stored by nature order feature in blocklet level to get a sorted data in each partition 3. implements merge-sort/TopN in the spark's sort physical plan actually I haveI optimized for "order by only 1 dimension + limit" base on branch 0.2. The performance is much better. sort by 1 dimension +limit 10000 in 100 million data , it only take less than 1 second to get and print the result. was: currently the order by dimension query's performance is very bad if there is no filter or filtered data is still to large. if I was not wrong, it read all related data in carbon scan physical level, decode the sort dimension's data and sort all of them in spark sql sort physical plan. I think we can optimize as below: 1. push down sort (+limit) to carbon scan 2. leverage the dimension's stored by nature order feature in blocklet level to get a sorted data in each partition 3. implements merge-sort/TopN in the spark's sort physical plan actually I haveI optimized for "order by only 1 dimension + limit" base on branch 0.2. The performance is much better. sort by 1 dimension +limit 10000 in 100 million data , it only take less than 1 second to get and print the result. 1. push down > order by query's performance is very bad > ---------------------------------------- > > Key: CARBONDATA-754 > URL: https://issues.apache.org/jira/browse/CARBONDATA-754 > Project: CarbonData > Issue Type: Improvement > Components: core, spark-integration > Reporter: Jarck > Assignee: Jarck > > currently the order by dimension query's performance is very bad if there is no filter or filtered data is still to large. > if I was not wrong, it read all related data in carbon scan physical level, decode the sort dimension's data and sort all of them in spark sql sort physical plan. > I think we can optimize as below: > 1. push down sort (+limit) to carbon scan > 2. leverage the dimension's stored by nature order feature in blocklet level to get a sorted data in each partition > 3. implements merge-sort/TopN in the spark's sort physical plan > actually I haveI optimized for "order by only 1 dimension + limit" base on branch 0.2. The performance is much better. > sort by 1 dimension +limit 10000 in 100 million data , it only take less than 1 second to get and print the result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) |
Free forum by Nabble | Edit this page |