Apache CarbonData Dev Mailing List archive

Aggregate performace

Classic

List

Threaded

3 messages Options

ffpeng90

Feb 08, 2017; 6:19am

Aggregate performace

Hi,all:
Recently, I create two tables as ORC and Carbondata. All of them contain one hundred million records.
Then I submit aggregate querys to presto like : [Select count(*) from tableB where attributeA = 'xxx'],
carbon performs better than orc.

However, when i submit querys like: [Select attributeA , count(*) from tableB group by attributeA], the performace of carbon is bad. Obviously this query will result-in a full scan, so QueryModel need to rebuild all records with columns related. This step need a lot of time.

So i want to know is there any optimize techniques for this kind of problems in spark?

ravipesala

Feb 08, 2017; 1:04pm

Re: Aggregate performace

Hi,

The performance is depends on the query plan, when you submit the query
like [Select attributeA , count(*) from tableB group by attributeA] in
case of spark it asks carbon to give only attributeA column. So Carbon
reads only attributeA column from all files send the result to spark to
aggregate data.

In my laptop with 4 cores test, Spark2 with carbon with store size of
100million records could get result in 11 seconds for query like above. In
good machines this may get very faster.

Regards,
Ravindra.

On 8 February 2017 at 11:49, ffpeng90 <[hidden email]> wrote:

> Hi,all:
> Recently, I create two tables as ORC and Carbondata. All of them
> contain
> one hundred million records.
> Then I submit aggregate querys to presto like : [Select count(*) from
> tableB where attributeA = 'xxx'],
> carbon performs better than orc.
>
> However, when i submit querys like: [Select attributeA , count(*) from
> tableB group by attributeA], the performace of carbon is bad. Obviously
> this query will result-in a full scan, so QueryModel need to rebuild all
> records with columns related. This step need a lot of time.
>
> So i want to know is there any optimize techniques for this kind of
> problems
> in spark?
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Aggregate-
> performace-tp7440.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

... [show rest of quote]

--
Thanks & Regards,
Ravi

ffpeng90

Feb 09, 2017; 6:47am

Re: Aggregate performace

OK, Thanks for your answer.
The major logic seems to be the same.
However on my machine, carbon costs 3-4 times than orc when grouping by field.
I will try some solutions on presto like concurrency and improve my hardware for test.