Apache CarbonData Dev Mailing List archive

Aggregate performace

Posted by ffpeng90 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Aggregate-performace-tp7440.html

Hi,all:
Recently, I create two tables as ORC and Carbondata. All of them contain one hundred million records.
Then I submit aggregate querys to presto like : [Select count(*) from tableB where attributeA = 'xxx'],
carbon performs better than orc.

However, when i submit querys like: [Select attributeA , count(*) from tableB group by attributeA], the performace of carbon is bad. Obviously this query will result-in a full scan, so QueryModel need to rebuild all records with columns related. This step need a lot of time.

So i want to know is there any optimize techniques for this kind of problems in spark?