http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Presto-CarbonData-optimization-work-discussion-tp18509p18531.html
Thanks for your comment.
I tested again with excluding province as dictionary. In spark, the query
time is around 3 seconds, in presto same is 9 seconds. so for this query
case(short string), dictionary lazy decode might not be the key factor.
> Hi Liang,
>
> I see that province column data is not big, so I guess it hardly make any
> impact with lazy decoding in this scenario. Can you do one more test by
> excluding the province from dictionary in both presto and spark
> integrations. It will tell whether it is really a lazy decoding issue or
> not.
>
> Regards,
> Ravindra
>
> On 20 July 2017 at 08:04, Liang Chen <
[hidden email]> wrote:
>
> > Hi
> >
> > For -- 4) Lazy decoding of the dictionary, just i tested 180 millions
> rows
> > data with the script:
> > "select province,sum(age),count(*) from presto_carbondata group by
> province
> > order by province"
> >
> > Spark integration module has "dictionary lazy decode", presto doesn't
> have
> > "dictionary lazy decode", the performance is 4.5 times difference, so
> > "dictionary lazy decode" might much help to improve aggregation
> > performance.
> >
> > The detail test result as below :
> >
> > *1. Presto+CarbonData is 9 second:*
> > presto:default> select province,sum(age),count(*) from presto_carbondata
> > group by province order by province;
> > province | _col1 | _col2
> > ----------+----------+---------
> > AB | 57442740 | 1385010
> > BC | 57488826 | 1385580
> > MB | 57564702 | 1386510
> > NB | 57599520 | 1386960
> > NL | 57446592 | 1383774
> > NS | 57448734 | 1384272
> > NT | 57534228 | 1386936
> > NU | 57506844 | 1385346
> > ON | 57484956 | 1384470
> > PE | 57325164 | 1379802
> > QC | 57467886 | 1385076
> > SK | 57385152 | 1382364
> > YT | 57377556 | 1383900
> > (13 rows)
> >
> > Query 20170720_022833_00004_c9ky2, FINISHED, 1 node
> > Splits: 55 total, 55 done (100.00%)
> > 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s]
> >
> > *2.Spark+CarbonData is :2 seconds*
> > scala> benchmark { carbon.sql("select province,sum(age),count(*) from
> > presto_carbondata group by province order by province").show }
> > +--------+--------+--------+
> > |province|sum(age)|count(1)|
> > +--------+--------+--------+
> > | AB|57442740| 1385010|
> > | BC|57488826| 1385580|
> > | MB|57564702| 1386510|
> > | NB|57599520| 1386960|
> > | NL|57446592| 1383774|
> > | NS|57448734| 1384272|
> > | NT|57534228| 1386936|
> > | NU|57506844| 1385346|
> > | ON|57484956| 1384470|
> > | PE|57325164| 1379802|
> > | QC|57467886| 1385076|
> > | SK|57385152| 1382364|
> > | YT|57377556| 1383900|
> > +--------+--------+--------+
> >
> > 2109.346231ms
> >
> >
> >
> > --
> > View this message in context:
http://apache-carbondata-dev-> > mailing-list-archive.1130556.n5.nabble.com/Presto-
> > CarbonData-optimization-work-discussion-tp18509p18522.html
> > Sent from the Apache CarbonData Dev Mailing List archive mailing list
> > archive at Nabble.com.
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>