This post was updated on .
Hi community,
I'm afraid of the query plan is broken. The skipped the block/blocklet is not right. Please see the image in current post and below post. Screen_Shot_2018-09-30_at_5.png <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-30_at_5.png> <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-30_at_5.png> And my env is carbondata1.5.0 and spark 2.3.2, but I'm not sure the other version is right or not -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Screen_Shot_2018-09-30_at_5.png
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-30_at_5.png> -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Aaron:
Can you list you create table sql and select sql for us? And is it correct for spark 2.2? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi xm_zzc, Thanks for you response. I test based on 2.3.2, not test 2.2.2. And I have merged the fix come from ISSUE http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Issue-Dictionary-and-S3-td63106.html and http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Serious-Issue-Query-get-inconsistent-result-on-carbon1-5-0-td63691.html Create query: val createStoreTableSql = s""" | CREATE TABLE IF NOT EXISTS storev2( | market_code STRING, | device_code STRING, | country_code STRING, | category_id INTEGER, | product_id LONG, | date TIMESTAMP, | est_free_app_download LONG, | est_paid_app_download LONG, | est_revenue LONG | ) | STORED BY 'carbondata' | TBLPROPERTIES( | 'SORT_COLUMNS'='market_code, device_code, country_code, category_id, date, product_id', | 'NO_INVERTED_INDEX'='est_free_app_download, est_paid_app_download, est_revenue', | 'DICTIONARY_INCLUDE'='market_code, device_code, country_code, category_id', | 'SORT_SCOPE'='GLOBAL_SORT', | 'CACHE_LEVEL'='BLOCKLET', | 'TABLE_BLOCKSIZE'='256', | 'GLOBAL_SORT_PARTITIONS'='3' | ) """.stripMargin val createTimeSeriesDayNoProductTableSql = s""" | CREATE DATAMAP IF NOT EXISTS agg_by_day ON TABLE storev2 | USING 'timeSeries' | DMPROPERTIES ( | 'EVENT_TIME'='date', | 'DAY_GRANULARITY'='1') | AS SELECT date, market_code, device_code, country_code, category_id, | COUNT(product_id), COUNT(est_free_app_download), COUNT(est_free_app_download), COUNT(est_revenue), | SUM(est_free_app_download), MIN(est_free_app_download), MAX(est_free_app_download), | SUM(est_paid_app_download), MIN(est_paid_app_download), MAX(est_paid_app_download), | SUM(est_revenue), MIN(est_revenue), MAX(est_revenue) | FROM storev2 | GROUP BY date, market_code, device_code, country_code, category_id """.stripMargin carbon.sql(createTimeSeriesDayNoProductTableSql) One of the query: SELECT timeseries(date, 'DAY') as day, market_code, device_code, country_code, category_id, COUNT(product_id), COUNT(est_free_app_download), COUNT(est_free_app_download), COUNT(est_revenue), sum(est_free_app_download), min(est_free_app_download), max(est_free_app_download), sum(est_paid_app_download), min(est_paid_app_download), max(est_paid_app_download), sum(est_revenue), min(est_revenue), max(est_revenue) FROM storev2 WHERE market_code='apple-store' AND device_code='ios-phone' AND country_code IN ('US', 'CN') GROUP BY timeseries(date, 'DAY'), market_code, device_code, country_code, category_id LIMIT 10; -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xm_zzc
I think the query plan info is not right,
1. Total blocklet from carbondata cli is 233 + 86 = 319 2. But query plan tell me that I have 560 blocklet I hope below info could help you to locate issue. *********************************************************************************** I use carbondata cli print the blocklet summary like below: java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*" org.apache.carbondata.tool.CarbonCli -cmd summary -a -p hdfs://ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/storev3/ ## Summary total: 80 blocks, 9 shards, 233 blocklets, 62,698 pages, 2,006,205,228 rows, 12.40GB avg: 158.72MB/block, 54.50MB/blocklet, 25,077,565 rows/block, 8,610,322 rows/blocklet java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*" org.apache.carbondata.tool.CarbonCli -cmd summary -a -p hdfs://ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/usage_basickpi/ ## Summary total: 30 blocks, 14 shards, 86 blocklets, 3,498 pages, 111,719,467 rows, 4.24GB avg: 144.57MB/block, 50.43MB/blocklet, 3,723,982 rows/block, 1,299,063 rows/blocklet ************************************************************************************ But at the same time, I run a sql, carbon told me below info: |== CarbonData Profiler == Table Scan on storev3 - total: 194 blocks, 560 blocklets - filter: (((((((((granularity <> null and date <> null) and date >= 1472688000000000 between date <= 1475280000000000) and true) and granularity = monthly) and country_code in (LiteralExpression(US);LiteralExpression(CN);LiteralExpression(JP);)) and device_code in (LiteralExpression(ios-phone);)) and product_id <> null) and country_code <> null) and device_code <> null) - pruned by Main DataMap - skipped: 192 blocks, 537 blocklets ************************************************************************************ The select sql like is SELECT f.country_code, f.date, f.product_id, f.category_id, f.arpu FROM ( SELECT a.country_code, a.date, a.product_id, a.category_id, a.revenue/a.average_active_users as arpu FROM( SELECT r.device_code, r.category_id, r.country_code, r.date, r.product_id, r.revenue, u.average_active_users FROM ( SELECT b.device_code, b.country_code, b.product_id, b.date, b.category_id, sum(b.revenue) as revenue FROM ( SELECT v.device_code, v.country_code, v.product_id, v.revenue, v.date, p.category_id FROM ( SELECT device_code, country_code, product_id, est_revenue as revenue, timeseries(date, 'month') as date FROM storev3 WHERE market_code='apple-store' AND date BETWEEN '2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and country_code in ('US', 'CN', 'JP') ) as v JOIN( SELECT DISTINCT product_id, category_id FROM storev3 WHERE market_code='apple-store' AND date BETWEEN '2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and category_id in (100000, 100001, 100021) and country_code in ('US', 'CN', 'JP') ) as p ON p.product_id = v.product_id ) as b GROUP BY b.device_code, b.country_code, b.product_id, b.date, b.category_id ) AS r JOIN ( SELECT country_code, date, product_id, (CASE WHEN est_average_active_users is not NULL THEN est_average_active_users ELSE 0 END) as average_active_users, device_code FROM usage_basickpi WHERE date BETWEEN '2016-09-01' AND '2016-10-01'and granularity ='monthly' and country_code in ('US', 'CN', 'JP') AND device_code in ('ios-phone') ) AS u ON r.country_code=u.country_code AND r.date=u.date AND r.product_id=u.product_id AND r.device_code=u.device_code ) AS a )AS f ORDER BY f.arpu DESC LIMIT 10 Thanks Aaron -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Aaron,
CarbonData profiler is not tested feature added in old version, So it might have broken or not adding correct information during explain command. We will try to correct it in the next version, meanwhile can you please check and make sure that the data you are getting from query is right. Regards, Ravindra. On Mon, 1 Oct 2018 at 21:23, aaron <[hidden email]> wrote: > I think the query plan info is not right, > > 1. Total blocklet from carbondata cli is 233 + 86 = 319 > 2. But query plan tell me that I have 560 blocklet > > I hope below info could help you to locate issue. > > > *********************************************************************************** > I use carbondata cli print the blocklet summary like below: > > java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*" > org.apache.carbondata.tool.CarbonCli -cmd summary -a -p > hdfs:// > ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/storev3/ > > ## Summary > total: 80 blocks, 9 shards, 233 blocklets, 62,698 pages, 2,006,205,228 > rows, > 12.40GB > avg: 158.72MB/block, 54.50MB/blocklet, 25,077,565 rows/block, 8,610,322 > rows/blocklet > > java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*" > org.apache.carbondata.tool.CarbonCli -cmd summary -a -p > hdfs:// > ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/usage_basickpi/ > > ## Summary > total: 30 blocks, 14 shards, 86 blocklets, 3,498 pages, 111,719,467 rows, > 4.24GB > avg: 144.57MB/block, 50.43MB/blocklet, 3,723,982 rows/block, 1,299,063 > rows/blocklet > > > > ************************************************************************************ > But at the same time, I run a sql, carbon told me below info: > > |== CarbonData Profiler == > Table Scan on storev3 > - total: 194 blocks, 560 blocklets > - filter: (((((((((granularity <> null and date <> null) and date >= > 1472688000000000 between date <= 1475280000000000) and true) and > granularity > = monthly) and country_code in > (LiteralExpression(US);LiteralExpression(CN);LiteralExpression(JP);)) and > device_code in (LiteralExpression(ios-phone);)) and product_id <> null) and > country_code <> null) and device_code <> null) > - pruned by Main DataMap > - skipped: 192 blocks, 537 blocklets > > > > ************************************************************************************ > The select sql like is > > SELECT f.country_code, f.date, f.product_id, f.category_id, f.arpu FROM ( > SELECT a.country_code, a.date, a.product_id, a.category_id, > a.revenue/a.average_active_users as arpu > FROM( > SELECT r.device_code, r.category_id, r.country_code, r.date, > r.product_id, r.revenue, u.average_active_users > FROM > ( > SELECT b.device_code, b.country_code, b.product_id, b.date, > b.category_id, sum(b.revenue) as revenue > FROM ( > SELECT v.device_code, v.country_code, v.product_id, > v.revenue, v.date, p.category_id FROM > ( > SELECT device_code, country_code, product_id, > est_revenue as revenue, timeseries(date, 'month') as date > FROM storev3 > WHERE market_code='apple-store' AND date BETWEEN > '2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and > country_code in ('US', 'CN', 'JP') > ) as v > JOIN( > SELECT DISTINCT product_id, category_id > FROM storev3 > WHERE market_code='apple-store' AND date BETWEEN > '2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and > category_id in (100000, 100001, 100021) and country_code in ('US', 'CN', > 'JP') > ) as p > ON p.product_id = v.product_id > ) as b > GROUP BY b.device_code, b.country_code, b.product_id, b.date, > b.category_id > ) AS r > JOIN > ( > SELECT country_code, date, product_id, (CASE WHEN > est_average_active_users is not NULL THEN est_average_active_users ELSE 0 > END) as average_active_users, device_code > FROM usage_basickpi > WHERE date BETWEEN '2016-09-01' AND '2016-10-01'and granularity > ='monthly' and country_code in ('US', 'CN', 'JP') AND device_code in > ('ios-phone') > ) AS u > ON r.country_code=u.country_code AND r.date=u.date AND > r.product_id=u.product_id AND r.device_code=u.device_code > ) AS a > )AS f > ORDER BY f.arpu DESC > LIMIT 10 > > Thanks > Aaron > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > -- Thanks & Regards, Ravi |
Data should be right.
-- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi, aaron.
For the wrong pruning information statistics in the query plan, do you execute the queries concurrently? I noticed that the pruning collector is single thread, if you ran queries concurrently, the statistics for pruning will be incorrect. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |