Hi Carbon Dev,
We faced slow sqls when apply carbondata in business system. In our custom info management system, only 2 million data in carbondata. we need to fetch data sorting by the costum’s phone number field and paginate data on the web page. For example, when loading our system index web page, it will execute the below similar sql Select phone_number, other_field1, other_field2 , other_field3 , other_field4 from custominfo order by phone_number limit 10; In our prod env, it takes about 4 seconds to execute this sql, it is slow for only 2 million data system. In another car info management system, it has 0.1 billion data in carbondata. need to fetch data by sorting the car id field and business date, and paginate data on the web page. Similar sql as below Select car_id, business_date, other_field1, other_field2, other_field3, other_field4 from carinfo order by car_info, business_date limit 10; I done test in my local , it takes about 30 seconds to execute the sql. In short, the more data, the worse performance even we just need the top 10 since it used full scan for orderby operation Actually a few months ago, I have come up with the optimization plan for orderby+limit, Please refer to http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860 <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860> mainly, pushdown order by +limit to carbon scan, leverage the sort column’s order stored feature to get topN sorted data in each block and reduce io scan. From test result, it can improve about 80% performance. In previous discussion, we have the below conclusion. The optimization solution has below limitations 1. It cannot work for dictionary columns. As there is no guarantee that dictionary allocation is in sorted order. 2. It only can work when order by one dimension + limit or order by prefix columns of MDK 3. It can’t optimize when orderby measures +limit This time I think we can optimize the below cases, please give your suggestion 1. Order by prefix columns of sort columns if no dictionary encode + limit 2. Order by a dimension with no dictionary encode + limit for example date, string type column 3. Order by a dimension with dictionary encode if creating table with a pre-defined dictionaries + limit 4. Order by a number field which is added to sort colums + limit for example int, bitint type column 5. Order by dimension with dictionary encode + limit In this case, there is no guarantee that dictionary allocation is in sorted order. So maybe we can calculate the dictionary’s order according to the original value’s order in memory And get top N according to the original value’s order For example in table t3 having a country field with dictionary encode, Country dictionary Australia 6 Canada 2 China 3 Uk 4 Usa 5 In Blocklet 1 dict value from 2 to 3 and have 32000 data In Blocklet 2 dict value from 3 to 5 and have 32000 data In Blocklet 3 dict value from 6 to 6 and have 32000 data for the below Sql, apparently we only need to process blocklet3, it can reduce process time for blocklet1 and blocklet2 Select country , name from t3 order by country limit 100; Thanks Jarck |
This post was updated on .
Recently , I used the latest code done test as below
1. Create Table: CREATE TABLE rx5_tbox_parquet_all( carid STRING, inputstime TIMESTAMP, carsyspwrmod INT, cardofrontpas INT, cardofrontdrv INT, cardorearleft INT, cardorearright INT, carbonnet INT, carboot INT, carwinfrontleft INT, carwinrearleft INT, carwinfrontright INT, carwinrearright INT, carsunroof INT, carcsactive INT, carcsenabled INT, carseatbeltdrv INT ) STORED BY 'carbondata' TBLPROPERTIES('SORT_COLUMNS'='carid', 'DICTIONARY_INCLUDE'='carid') 2. Load 0.1 billion data 3. Run the below sql select carid, inputstime, carsyspwrmod, cardofrontpas, cardofrontdrv, cardorearleft, cardorearright, carbonnet, carboot, carwinfrontleft, carwinrearleft, carwinfrontright, carwinrearright from rx5_tbox_parquet_all2 order by carid limit 10 Use carbondata1.2 master code + spark2.1 to run |carid |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright| +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ limit 10 query time: 28777 milliseconds Use orderby +limit optimized carbondata1.2 master code + spark1.6.3 to run +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ |carid |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright| +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | |LSJA24790HS020662|null |2 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 | +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ limit 10 query time: 1640 milliseconds Apparently, after optimization, even I use spark1.6.3, it also improved 90% performance Thanks 马云 -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by simafengyun
Hi,
As I understand as per your business scenario you require to sort on dimension and apply a limit on it for pagination on your web application. It is a very specific scenario for your case to handle. It is not even topN queries as it is not applying on measures. If we push down this type of specific scenarios to carbon we will have maintainability issues in long-term. It is recommended that these type of optimizations can be done in the planning phase or integration layer and not to the level of carbon layer. Regards, Ravindra -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Administrator
|
In reply to this post by simafengyun
Hi Jarck
Did this solution use dictionary to do limit , right ? this solution can't make sure the data correctness ------------------------------------------------------------------------------- Use orderby +limit optimized carbondata1.2 master code + spark1.6.3 @Ravindra @Jarck : let us discuss it, see if can find out the common solution, Proposal 1 : Implement TopN solution. Proposal 2 : Decoupled plug-in solution, continue to use Jarck's code, just need change order by from dictionary value to actual dimensional value. Regards Liang simafengyun wrote > Recently , I used the latest code done test as below > 1. Create Table: > CREATE TABLE rx5_tbox_parquet_all( > carid STRING, > inputstime TIMESTAMP, > carsyspwrmod INT, > cardofrontpas INT, > cardofrontdrv INT, > cardorearleft INT, > cardorearright INT, > carbonnet INT, > carboot INT, > carwinfrontleft INT, > carwinrearleft INT, > carwinfrontright INT, > carwinrearright INT, > carsunroof INT, > carcsactive INT, > carcsenabled INT, > carseatbeltdrv INT > ) > STORED BY 'carbondata' > TBLPROPERTIES('SORT_COLUMNS'='carid', > 'DICTIONARY_INCLUDE'='carid') > > 2. Load 0.1 billion data > > 3. Run the below sql > > select > carid, > inputstime, > carsyspwrmod, > cardofrontpas, > cardofrontdrv, > cardorearleft, > cardorearright, > carbonnet, > carboot, > carwinfrontleft, > carwinrearleft, > carwinfrontright, > carwinrearright > from rx5_tbox_parquet_all2 > order by carid > limit 10 > > > Use carbondata1.2 master code + spark2.1 to run > |carid > |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright| > +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ > > limit 10 query time: 28777 milliseconds > > Use orderby +limit optimized carbondata1.2 master code + spark1.6.3 to run > > +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ > |carid > |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright| > +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > |LSJA24790HS020662|null |2 |0 |0 > |0 > |0 |0 |0 |0 |0 > > |0 |0 | > +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ > > limit 10 query time: 1640 milliseconds > > Apparently, after optimization, even I use spark1.6.3, it also improved > 90% > performance > > > > > > > > > > Thanks > 马云 > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Liang,
yes, your right, I used the dictionary to do the limit and this solution can’t make sure the data correctness. But, I also done order by test on date field which is the first column of MDK key, date field used the direct dictionary encode, it should can make sure the data correctness. after test , it also can improve more than 80% performance > 在 2017年9月29日,下午3:34,Liang Chen <[hidden email]> 写道: > > Hi Jarck > > Did this solution use dictionary to do limit , right ? this solution can't > make sure the data correctness > ------------------------------------------------------------------------------- > Use orderby +limit optimized carbondata1.2 master code + spark1.6.3 > > @Ravindra @Jarck : let us discuss it, see if can find out the common > solution, > Proposal 1 : Implement TopN solution. > Proposal 2 : Decoupled plug-in solution, continue to use Jarck's code, > just need change order by from dictionary value to actual dimensional value. > > Regards > Liang > > simafengyun wrote >> Recently , I used the latest code done test as below >> 1. Create Table: >> CREATE TABLE rx5_tbox_parquet_all( >> carid STRING, >> inputstime TIMESTAMP, >> carsyspwrmod INT, >> cardofrontpas INT, >> cardofrontdrv INT, >> cardorearleft INT, >> cardorearright INT, >> carbonnet INT, >> carboot INT, >> carwinfrontleft INT, >> carwinrearleft INT, >> carwinfrontright INT, >> carwinrearright INT, >> carsunroof INT, >> carcsactive INT, >> carcsenabled INT, >> carseatbeltdrv INT >> ) >> STORED BY 'carbondata' >> TBLPROPERTIES('SORT_COLUMNS'='carid', >> 'DICTIONARY_INCLUDE'='carid') >> >> 2. Load 0.1 billion data >> >> 3. Run the below sql >> >> select >> carid, >> inputstime, >> carsyspwrmod, >> cardofrontpas, >> cardofrontdrv, >> cardorearleft, >> cardorearright, >> carbonnet, >> carboot, >> carwinfrontleft, >> carwinrearleft, >> carwinfrontright, >> carwinrearright >> from rx5_tbox_parquet_all2 >> order by carid >> limit 10 >> >> >> Use carbondata1.2 master code + spark2.1 to run >> |carid >> |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright| >> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ >> >> limit 10 query time: 28777 milliseconds >> >> Use orderby +limit optimized carbondata1.2 master code + spark1.6.3 to run >> >> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ >> |carid >> |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright| >> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> |LSJA24790HS020662|null |2 |0 |0 >> |0 >> |0 |0 |0 |0 |0 >> >> |0 |0 | >> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+ >> >> limit 10 query time: 1640 milliseconds >> >> Apparently, after optimization, even I use spark1.6.3, it also improved >> 90% >> performance >> >> >> >> >> >> >> >> >> >> Thanks >> 马云 >> >> >> >> >> -- >> Sent from: >> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/> |
Free forum by Nabble | Edit this page |