Apache CarbonData Dev Mailing List archive

[Issue] Query plan issue

Classic

List

Threaded

8 messages Options

aaron

[Issue] Query plan issue

This post was updated on .

Hi community,

I'm afraid of the query plan is broken. The skipped the block/blocklet is not right.

Please see the image
in current post and below post. Screen_Shot_2018-09-30_at_5.png
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-30_at_5.png>
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-30_at_5.png>

And my env is carbondata1.5.0 and spark 2.3.2, but I'm not sure the other version is right or not

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

aaron

Re: [ISSUE] carbondata1.5.0 and spark 2.3.2 query plan issue

Screen_Shot_2018-09-30_at_5.png
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-30_at_5.png>

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xm_zzc

Re: [ISSUE] carbondata1.5.0 and spark 2.3.2 query plan issue

Hi Aaron:
Can you list you create table sql and select sql for us? And is it correct
for spark 2.2?

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

aaron

Re: [ISSUE] carbondata1.5.0 and spark 2.3.2 query plan issue

Hi xm_zzc,

Thanks for you response. I test based on 2.3.2, not test 2.2.2. And I have
merged the fix come from ISSUE
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Issue-Dictionary-and-S3-td63106.html
and
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Serious-Issue-Query-get-inconsistent-result-on-carbon1-5-0-td63691.html

Create query:
val createStoreTableSql = s"""
| CREATE TABLE IF NOT EXISTS storev2(
| market_code STRING,
| device_code STRING,
| country_code STRING,
| category_id INTEGER,
| product_id LONG,
| date TIMESTAMP,
| est_free_app_download LONG,
| est_paid_app_download LONG,
| est_revenue LONG
| )
| STORED BY 'carbondata'
| TBLPROPERTIES(
| 'SORT_COLUMNS'='market_code,
device_code, country_code, category_id, date, product_id',
|
'NO_INVERTED_INDEX'='est_free_app_download, est_paid_app_download,
est_revenue',
| 'DICTIONARY_INCLUDE'='market_code,
device_code, country_code, category_id',
| 'SORT_SCOPE'='GLOBAL_SORT',
| 'CACHE_LEVEL'='BLOCKLET',
| 'TABLE_BLOCKSIZE'='256',
| 'GLOBAL_SORT_PARTITIONS'='3'
| )
""".stripMargin

val createTimeSeriesDayNoProductTableSql = s"""
| CREATE DATAMAP IF NOT
EXISTS agg_by_day ON TABLE storev2
| USING 'timeSeries'
| DMPROPERTIES (
| 'EVENT_TIME'='date',
| 'DAY_GRANULARITY'='1')
| AS SELECT date,
market_code, device_code, country_code, category_id,
| COUNT(product_id),
COUNT(est_free_app_download), COUNT(est_free_app_download),
COUNT(est_revenue),
|
SUM(est_free_app_download), MIN(est_free_app_download),
MAX(est_free_app_download),
|
SUM(est_paid_app_download), MIN(est_paid_app_download),
MAX(est_paid_app_download),
| SUM(est_revenue),
MIN(est_revenue), MAX(est_revenue)
| FROM storev2
| GROUP BY date,
market_code, device_code, country_code, category_id
""".stripMargin
carbon.sql(createTimeSeriesDayNoProductTableSql)

One of the query:

SELECT timeseries(date, 'DAY') as day, market_code, device_code,
country_code, category_id,
COUNT(product_id), COUNT(est_free_app_download),
COUNT(est_free_app_download), COUNT(est_revenue),
sum(est_free_app_download), min(est_free_app_download),
max(est_free_app_download),
sum(est_paid_app_download), min(est_paid_app_download),
max(est_paid_app_download),
sum(est_revenue), min(est_revenue), max(est_revenue)
FROM storev2 WHERE market_code='apple-store' AND
device_code='ios-phone' AND country_code IN ('US', 'CN')
GROUP BY timeseries(date, 'DAY'), market_code, device_code,
country_code, category_id LIMIT 10;

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

aaron

Re: [ISSUE] carbondata1.5.0 and spark 2.3.2 query plan issue

In reply to this post by xm_zzc

I think the query plan info is not right,

1. Total blocklet from carbondata cli is 233 + 86 = 319
2. But query plan tell me that I have 560 blocklet

I hope below info could help you to locate issue.

***********************************************************************************
I use carbondata cli print the blocklet summary like below:

java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*"
org.apache.carbondata.tool.CarbonCli -cmd summary -a -p
hdfs://ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/storev3/

## Summary
total: 80 blocks, 9 shards, 233 blocklets, 62,698 pages, 2,006,205,228 rows,
12.40GB
avg: 158.72MB/block, 54.50MB/blocklet, 25,077,565 rows/block, 8,610,322
rows/blocklet

java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*"
org.apache.carbondata.tool.CarbonCli -cmd summary -a -p
hdfs://ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/usage_basickpi/

## Summary
total: 30 blocks, 14 shards, 86 blocklets, 3,498 pages, 111,719,467 rows,
4.24GB
avg: 144.57MB/block, 50.43MB/blocklet, 3,723,982 rows/block, 1,299,063
rows/blocklet

************************************************************************************
But at the same time, I run a sql, carbon told me below info:

|== CarbonData Profiler ==
Table Scan on storev3
- total: 194 blocks, 560 blocklets
- filter: (((((((((granularity <> null and date <> null) and date >=
1472688000000000 between date <= 1475280000000000) and true) and granularity
= monthly) and country_code in
(LiteralExpression(US);LiteralExpression(CN);LiteralExpression(JP);)) and
device_code in (LiteralExpression(ios-phone);)) and product_id <> null) and
country_code <> null) and device_code <> null)
- pruned by Main DataMap
- skipped: 192 blocks, 537 blocklets

************************************************************************************
The select sql like is

SELECT f.country_code, f.date, f.product_id, f.category_id, f.arpu FROM (
SELECT a.country_code, a.date, a.product_id, a.category_id,
a.revenue/a.average_active_users as arpu
FROM(
SELECT r.device_code, r.category_id, r.country_code, r.date,
r.product_id, r.revenue, u.average_active_users
FROM
(
SELECT b.device_code, b.country_code, b.product_id, b.date,
b.category_id, sum(b.revenue) as revenue
FROM (
SELECT v.device_code, v.country_code, v.product_id,
v.revenue, v.date, p.category_id FROM
(
SELECT device_code, country_code, product_id,
est_revenue as revenue, timeseries(date, 'month') as date
FROM storev3
WHERE market_code='apple-store' AND date BETWEEN
'2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and
country_code in ('US', 'CN', 'JP')
) as v
JOIN(
SELECT DISTINCT product_id, category_id
FROM storev3
WHERE market_code='apple-store' AND date BETWEEN
'2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and
category_id in (100000, 100001, 100021) and country_code in ('US', 'CN',
'JP')
) as p
ON p.product_id = v.product_id
) as b
GROUP BY b.device_code, b.country_code, b.product_id, b.date,
b.category_id
) AS r
JOIN
(
SELECT country_code, date, product_id, (CASE WHEN
est_average_active_users is not NULL THEN est_average_active_users ELSE 0
END) as average_active_users, device_code
FROM usage_basickpi
WHERE date BETWEEN '2016-09-01' AND '2016-10-01'and granularity
='monthly' and country_code in ('US', 'CN', 'JP') AND device_code in
('ios-phone')
) AS u
ON r.country_code=u.country_code AND r.date=u.date AND
r.product_id=u.product_id AND r.device_code=u.device_code
) AS a
)AS f
ORDER BY f.arpu DESC
LIMIT 10

Thanks
Aaron

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

ravipesala

Re: [ISSUE] carbondata1.5.0 and spark 2.3.2 query plan issue

Hi Aaron,

CarbonData profiler is not tested feature added in old version, So it might
have broken or not adding correct information during explain command. We
will try to correct it in the next version, meanwhile can you please
check and make sure that the data you are getting from query is right.

Regards,
Ravindra.

On Mon, 1 Oct 2018 at 21:23, aaron <[hidden email]> wrote:

> I think the query plan info is not right,
>
> 1. Total blocklet from carbondata cli is 233 + 86 = 319
> 2. But query plan tell me that I have 560 blocklet
>
> I hope below info could help you to locate issue.
>
>
> ***********************************************************************************
> I use carbondata cli print the blocklet summary like below:
>
> java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*"
> org.apache.carbondata.tool.CarbonCli -cmd summary -a -p
> hdfs://
> ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/storev3/
>
> ## Summary
> total: 80 blocks, 9 shards, 233 blocklets, 62,698 pages, 2,006,205,228
> rows,
> 12.40GB
> avg: 158.72MB/block, 54.50MB/blocklet, 25,077,565 rows/block, 8,610,322
> rows/blocklet
>
> java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*"
> org.apache.carbondata.tool.CarbonCli -cmd summary -a -p
> hdfs://
> ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/usage_basickpi/
>
> ## Summary
> total: 30 blocks, 14 shards, 86 blocklets, 3,498 pages, 111,719,467 rows,
> 4.24GB
> avg: 144.57MB/block, 50.43MB/blocklet, 3,723,982 rows/block, 1,299,063
> rows/blocklet
>
>
>
> ************************************************************************************
> But at the same time, I run a sql, carbon told me below info:
>
> |== CarbonData Profiler ==
> Table Scan on storev3
> - total: 194 blocks, 560 blocklets
> - filter: (((((((((granularity <> null and date <> null) and date >=
> 1472688000000000 between date <= 1475280000000000) and true) and
> granularity
> = monthly) and country_code in
> (LiteralExpression(US);LiteralExpression(CN);LiteralExpression(JP);)) and
> device_code in (LiteralExpression(ios-phone);)) and product_id <> null) and
> country_code <> null) and device_code <> null)
> - pruned by Main DataMap
> - skipped: 192 blocks, 537 blocklets
>
>
>
> ************************************************************************************
> The select sql like is
>
> SELECT f.country_code, f.date, f.product_id, f.category_id, f.arpu FROM (
> SELECT a.country_code, a.date, a.product_id, a.category_id,
> a.revenue/a.average_active_users as arpu
> FROM(
> SELECT r.device_code, r.category_id, r.country_code, r.date,
> r.product_id, r.revenue, u.average_active_users
> FROM
> (
> SELECT b.device_code, b.country_code, b.product_id, b.date,
> b.category_id, sum(b.revenue) as revenue
> FROM (
> SELECT v.device_code, v.country_code, v.product_id,
> v.revenue, v.date, p.category_id FROM
> (
> SELECT device_code, country_code, product_id,
> est_revenue as revenue, timeseries(date, 'month') as date
> FROM storev3
> WHERE market_code='apple-store' AND date BETWEEN
> '2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and
> country_code in ('US', 'CN', 'JP')
> ) as v
> JOIN(
> SELECT DISTINCT product_id, category_id
> FROM storev3
> WHERE market_code='apple-store' AND date BETWEEN
> '2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and
> category_id in (100000, 100001, 100021) and country_code in ('US', 'CN',
> 'JP')
> ) as p
> ON p.product_id = v.product_id
> ) as b
> GROUP BY b.device_code, b.country_code, b.product_id, b.date,
> b.category_id
> ) AS r
> JOIN
> (
> SELECT country_code, date, product_id, (CASE WHEN
> est_average_active_users is not NULL THEN est_average_active_users ELSE 0
> END) as average_active_users, device_code
> FROM usage_basickpi
> WHERE date BETWEEN '2016-09-01' AND '2016-10-01'and granularity
> ='monthly' and country_code in ('US', 'CN', 'JP') AND device_code in
> ('ios-phone')
> ) AS u
> ON r.country_code=u.country_code AND r.date=u.date AND
> r.product_id=u.product_id AND r.device_code=u.device_code
> ) AS a
> )AS f
> ORDER BY f.arpu DESC
> LIMIT 10
>
> Thanks
> Aaron
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

--
Thanks & Regards,
Ravi

aaron

Re: [ISSUE] carbondata1.5.0 and spark 2.3.2 query plan issue

Data should be right.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

Re: [ISSUE] carbondata1.5.0 and spark 2.3.2 query plan issue

Hi, aaron.
For the wrong pruning information statistics in the query plan, do you
execute the queries concurrently?
I noticed that the pruning collector is single thread, if you ran queries
concurrently, the statistics for pruning will be incorrect.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/