Apache CarbonData Dev Mailing List archive

Re: Presto+CarbonData optimization work discussion

Posted by Liang Chen on Jul 20, 2017; 2:34am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Presto-CarbonData-optimization-work-discussion-tp18509p18522.html

Hi

For -- 4) Lazy decoding of the dictionary, just i tested 180 millions rows data with the script:
"select province,sum(age),count(*) from presto_carbondata group by province order by province"

Spark integration module has "dictionary lazy decode", presto doesn't have "dictionary lazy decode", the performance is 4.5 times difference, so "dictionary lazy decode" might much help to improve aggregation performance.

The detail test result as below :

1. Presto+CarbonData is 9 second:
presto:default> select province,sum(age),count(*) from presto_carbondata group by province order by province;
province | _col1 | _col2
----------+----------+---------
AB | 57442740 | 1385010
BC | 57488826 | 1385580
MB | 57564702 | 1386510
NB | 57599520 | 1386960
NL | 57446592 | 1383774
NS | 57448734 | 1384272
NT | 57534228 | 1386936
NU | 57506844 | 1385346
ON | 57484956 | 1384470
PE | 57325164 | 1379802
QC | 57467886 | 1385076
SK | 57385152 | 1382364
YT | 57377556 | 1383900
(13 rows)

Query 20170720_022833_00004_c9ky2, FINISHED, 1 node
Splits: 55 total, 55 done (100.00%)
0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s]

2.Spark+CarbonData is :2 seconds
scala> benchmark { carbon.sql("select province,sum(age),count(*) from presto_carbondata group by province order by province").show }
+--------+--------+--------+
|province|sum(age)|count(1)|
+--------+--------+--------+
| AB|57442740| 1385010|
| BC|57488826| 1385580|
| MB|57564702| 1386510|
| NB|57599520| 1386960|
| NL|57446592| 1383774|
| NS|57448734| 1384272|
| NT|57534228| 1386936|
| NU|57506844| 1385346|
| ON|57484956| 1384470|
| PE|57325164| 1379802|
| QC|57467886| 1385076|
| SK|57385152| 1382364|
| YT|57377556| 1383900|
+--------+--------+--------+

2109.346231ms