Apache CarbonData Dev Mailing List archive

Re: Presto+CarbonData optimization work discussion

Posted by Liang Chen on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Presto-CarbonData-optimization-work-discussion-tp18509p21136.html

Hi

Based on pull request 1307, the latest test result as below, the performance
be improved 3 times.

presto:default> select province,sum(age),count(*) from presto_carbon_dict
group by province order by province;
province | _col1 | _col2
----------+----------+---------
AB | 57442740 | 1385010
BC | 57488826 | 1385580
MB | 57564702 | 1386510
NB | 57599520 | 1386960
NL | 57446592 | 1383774
NS | 57448734 | 1384272
NT | 57534228 | 1386936
NU | 57506844 | 1385346
ON | 57484956 | 1384470
PE | 57325164 | 1379802
QC | 57467886 | 1385076
SK | 57385152 | 1382364
YT | 57377556 | 1383900
(13 rows)

Query 20170902_033821_00006_h6g24, FINISHED, 1 node
Splits: 50 total, 50 done (100.00%)
0:03 [18M rows, 0B] [6.62M rows/s, 0B/s]

Regards
Liang

Liang Chen wrote

> Hi
>
> For -- 4) Lazy decoding of the dictionary, just i tested 180 millions
> rows data with the script:
> "select province,sum(age),count(*) from presto_carbondata group by
> province order by province"
>
> Spark integration module has "dictionary lazy decode", presto doesn't have
> "dictionary lazy decode", the performance is 4.5 times difference, so
> "dictionary lazy decode" might much help to improve aggregation
> performance.
>
> The detail test result as below :

*
> 1. Presto+CarbonData is 9 second:
*

> presto:default> select province,sum(age),count(*) from presto_carbondata
> group by province order by province;
> province | _col1 | _col2
> ----------+----------+---------
> AB | 57442740 | 1385010
> BC | 57488826 | 1385580
> MB | 57564702 | 1386510
> NB | 57599520 | 1386960
> NL | 57446592 | 1383774
> NS | 57448734 | 1384272
> NT | 57534228 | 1386936
> NU | 57506844 | 1385346
> ON | 57484956 | 1384470
> PE | 57325164 | 1379802
> QC | 57467886 | 1385076
> SK | 57385152 | 1382364
> YT | 57377556 | 1383900
> (13 rows)
>
> Query 20170720_022833_00004_c9ky2, FINISHED, 1 node
> Splits: 55 total, 55 done (100.00%)
> 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s]

*
> 2.Spark+CarbonData is :2 seconds
*

> scala> benchmark { carbon.sql("select province,sum(age),count(*) from
> presto_carbondata group by province order by province").show }
> +--------+--------+--------+
> |province|sum(age)|count(1)|
> +--------+--------+--------+
> | AB|57442740| 1385010|
> | BC|57488826| 1385580|
> | MB|57564702| 1386510|
> | NB|57599520| 1386960|
> | NL|57446592| 1383774|
> | NS|57448734| 1384272|
> | NT|57534228| 1386936|
> | NU|57506844| 1385346|
> | ON|57484956| 1384470|
> | PE|57325164| 1379802|
> | QC|57467886| 1385076|
> | SK|57385152| 1382364|
> | YT|57377556| 1383900|
> +--------+--------+--------+
>
> 2109.346231ms

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/