Login  Register

[DISCUSSION] optimization of OrderBy sorted columns + Limit Query

Posted by simafengyun on Sep 23, 2017; 4:26pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-optimization-of-OrderBy-sorted-columns-Limit-Query-tp23057.html

Hi Carbon Dev,

We faced slow sqls when apply carbondata in business system.

In our custom info management system, only 2 million data in carbondata.

we need to fetch data  sorting by the costum’s  phone number field and paginate data on the web page.

For example, when loading our system index  web page, it will execute the below similar sql

Select phone_number, other_field1, other_field2 , other_field3 , other_field4 from custominfo order by phone_number limit 10;

In our prod env, it takes about 4 seconds to execute this sql,  it is slow for only 2 million data system.


In another car info management system, it has  0.1 billion data in carbondata.
need to fetch data by sorting the car id field and business date,  and paginate data on the web page.

Similar sql as below

Select car_id,  business_date, other_field1, other_field2, other_field3, other_field4 from carinfo order by car_info, business_date limit 10;

I done test in my local , it takes  about 30 seconds to execute the sql.

In short, the more data, the worse performance even we just need the top 10 since it used full scan for orderby operation


Actually a few months ago, I have come up with the optimization plan for orderby+limit,
Please refer to

http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860 <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860>
mainly, pushdown order by +limit to carbon scan, leverage  the sort column’s order stored feature to get topN sorted data in each block  

and reduce io scan. From test result, it can improve about 80% performance.

In previous  discussion, we have the  below conclusion.

The  optimization solution has  below limitations

1.       It cannot work for dictionary columns. As there is no guarantee that
dictionary allocation is in sorted order.

2.       It only can work when order by one dimension + limit or order by prefix columns of MDK

3.      It can’t optimize when orderby measures +limit

 
This time I think we can optimize the below cases, please give your suggestion

1.      Order by prefix columns of  sort columns if no dictionary encode + limit

2.      Order by a dimension with no dictionary encode + limit

for example  date, string type column

3.      Order by a dimension with dictionary encode if creating table with a pre-defined dictionaries + limit

4.      Order by a number field which is added to sort colums + limit

for example  int, bitint type column

 
5.      Order by dimension with dictionary encode + limit
       In this case, there is no guarantee that
dictionary allocation is in sorted order.

So maybe we can calculate the dictionary’s order according to the original value’s order in memory

And get top N according to the original value’s order

For example in table t3 having a country field with dictionary encode,

Country           dictionary

Australia           6

Canada             2

China                3

Uk                    4

Usa                   5

In Blocklet 1 dict value from 2 to 3  and have 32000 data

In Blocklet 2 dict value from 3 to 5  and have 32000 data

In Blocklet 3 dict value from 6 to 6  and have 32000 data

for the below Sql, apparently we only need to process blocklet3, it can reduce process time for blocklet1 and blocklet2

Select  country , name from t3 order by country limit 100;

 
 
Thanks

 Jarck