Aggregate performace

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Aggregate performace

ffpeng90
Hi,all:
   Recently, I create two tables as ORC and Carbondata.  All of them contain one hundred million records.
Then I submit aggregate querys to presto like : [Select  count(*)  from tableB where attributeA = 'xxx'],
carbon performs better than orc.

However,  when i submit querys like: [Select attributeA , count(*)  from tableB group by attributeA],  the performace of carbon is bad. Obviously this query will result-in a full scan,  so QueryModel need to rebuild all records with columns related. This step need a lot of time.

So i want to know is there any optimize techniques for this kind of problems in spark?
Reply | Threaded
Open this post in threaded view
|

Re: Aggregate performace

ravipesala
Hi,

The performance is depends on the query plan, when you submit the query
like [Select attributeA , count(*)  from tableB group by attributeA]  in
case of spark it asks carbon to give only attributeA column. So Carbon
reads only attributeA column from all files send the result to spark to
aggregate data.

In my laptop with 4 cores test, Spark2 with carbon with store size of
100million records could get result in 11 seconds for query like above. In
good machines this may get very faster.

Regards,
Ravindra.

On 8 February 2017 at 11:49, ffpeng90 <[hidden email]> wrote:

> Hi,all:
>    Recently, I create two tables as ORC and Carbondata.  All of them
> contain
> one hundred million records.
> Then I submit aggregate querys to presto like : [Select  count(*)  from
> tableB where attributeA = 'xxx'],
> carbon performs better than orc.
>
> However,  when i submit querys like: [Select attributeA , count(*)  from
> tableB group by attributeA],  the performace of carbon is bad. Obviously
> this query will result-in a full scan,  so QueryModel need to rebuild all
> records with columns related. This step need a lot of time.
>
> So i want to know is there any optimize techniques for this kind of
> problems
> in spark?
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Aggregate-
> performace-tp7440.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Aggregate performace

ffpeng90
OK, Thanks for your answer.
The major logic seems to be the same.
However on my machine,  carbon costs 3-4 times than orc when grouping by field.
I will try some solutions on presto like concurrency and improve my hardware for test.