Apache CarbonData Dev Mailing List archive - [DISCUSSION] optimization of OrderBy sorted columns + Limit Query

Apache CarbonData Dev Mailing List archive

[DISCUSSION] optimization of OrderBy sorted columns + Limit Query

Posted by simafengyun on Sep 23, 2017; 4:26pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-optimization-of-OrderBy-sorted-columns-Limit-Query-tp23057.html

Hi Carbon Dev,

We faced slow sqls when apply carbondata in business system.

In our custom info management system, only 2 million data in carbondata.

we need to fetch data sorting by the costum’s phone number field and paginate data on the web page.

For example, when loading our system index web page, it will execute the below similar sql

Select phone_number, other_field1, other_field2 , other_field3 , other_field4 from custominfo order by phone_number limit 10;

In our prod env, it takes about 4 seconds to execute this sql, it is slow for only 2 million data system.

In another car info management system, it has 0.1 billion data in carbondata.
need to fetch data by sorting the car id field and business date, and paginate data on the web page.

Similar sql as below

Select car_id, business_date, other_field1, other_field2, other_field3, other_field4 from carinfo order by car_info, business_date limit 10;

I done test in my local , it takes about 30 seconds to execute the sql.

In short, the more data, the worse performance even we just need the top 10 since it used full scan for orderby operation

Actually a few months ago, I have come up with the optimization plan for orderby+limit,
Please refer to

http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860 <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860>
mainly, pushdown order by +limit to carbon scan, leverage the sort column’s order stored feature to get topN sorted data in each block

and reduce io scan. From test result, it can improve about 80% performance.

In previous discussion, we have the below conclusion.

The optimization solution has below limitations

1. It cannot work for dictionary columns. As there is no guarantee that
dictionary allocation is in sorted order.

2. It only can work when order by one dimension + limit or order by prefix columns of MDK

3. It can’t optimize when orderby measures +limit

This time I think we can optimize the below cases, please give your suggestion

1. Order by prefix columns of sort columns if no dictionary encode + limit

2. Order by a dimension with no dictionary encode + limit

for example date, string type column

3. Order by a dimension with dictionary encode if creating table with a pre-defined dictionaries + limit

4. Order by a number field which is added to sort colums + limit

for example int, bitint type column

5. Order by dimension with dictionary encode + limit
In this case, there is no guarantee that
dictionary allocation is in sorted order.

So maybe we can calculate the dictionary’s order according to the original value’s order in memory

And get top N according to the original value’s order

For example in table t3 having a country field with dictionary encode,

Country dictionary

Australia 6

Canada 2

China 3

Uk 4

Usa 5

In Blocklet 1 dict value from 2 to 3 and have 32000 data

In Blocklet 2 dict value from 3 to 5 and have 32000 data

In Blocklet 3 dict value from 6 to 6 and have 32000 data

for the below Sql, apparently we only need to process blocklet3, it can reduce process time for blocklet1 and blocklet2

Select country , name from t3 order by country limit 100;

Thanks

Jarck