Posted by
simafengyun on
Sep 23, 2017; 4:26pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-optimization-of-OrderBy-sorted-columns-Limit-Query-tp23057.html
Hi Carbon Dev,
We faced slow sqls when apply carbondata in business system.
In our custom info management system, only 2 million data in carbondata.
we need to fetch data sorting by the costum’s phone number field and paginate data on the web page.
For example, when loading our system index web page, it will execute the below similar sql
Select phone_number, other_field1, other_field2 , other_field3 , other_field4 from custominfo order by phone_number limit 10;
In our prod env, it takes about 4 seconds to execute this sql, it is slow for only 2 million data system.
In another car info management system, it has 0.1 billion data in carbondata.
need to fetch data by sorting the car id field and business date, and paginate data on the web page.
Similar sql as below
Select car_id, business_date, other_field1, other_field2, other_field3, other_field4 from carinfo order by car_info, business_date limit 10;
I done test in my local , it takes about 30 seconds to execute the sql.
In short, the more data, the worse performance even we just need the top 10 since it used full scan for orderby operation
Actually a few months ago, I have come up with the optimization plan for orderby+limit,
Please refer to
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860 <
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860>
mainly, pushdown order by +limit to carbon scan, leverage the sort column’s order stored feature to get topN sorted data in each block
and reduce io scan. From test result, it can improve about 80% performance.
In previous discussion, we have the below conclusion.
The optimization solution has below limitations
1. It cannot work for dictionary columns. As there is no guarantee that
dictionary allocation is in sorted order.
2. It only can work when order by one dimension + limit or order by prefix columns of MDK
3. It can’t optimize when orderby measures +limit
This time I think we can optimize the below cases, please give your suggestion
1. Order by prefix columns of sort columns if no dictionary encode + limit
2. Order by a dimension with no dictionary encode + limit
for example date, string type column
3. Order by a dimension with dictionary encode if creating table with a pre-defined dictionaries + limit
4. Order by a number field which is added to sort colums + limit
for example int, bitint type column
5. Order by dimension with dictionary encode + limit
In this case, there is no guarantee that
dictionary allocation is in sorted order.
So maybe we can calculate the dictionary’s order according to the original value’s order in memory
And get top N according to the original value’s order
For example in table t3 having a country field with dictionary encode,
Country dictionary
Australia 6
Canada 2
China 3
Uk 4
Usa 5
In Blocklet 1 dict value from 2 to 3 and have 32000 data
In Blocklet 2 dict value from 3 to 5 and have 32000 data
In Blocklet 3 dict value from 6 to 6 and have 32000 data
for the below Sql, apparently we only need to process blocklet3, it can reduce process time for blocklet1 and blocklet2
Select country , name from t3 order by country limit 100;
Thanks
Jarck