[jira] [Created] (CARBONDATA-840) Limit query performance optimization [Group By]

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (CARBONDATA-840) Limit query performance optimization [Group By]

Akash R Nilugal (Jira)
Cao, Lionel created CARBONDATA-840:
--------------------------------------

             Summary: Limit query performance optimization [Group By]
                 Key: CARBONDATA-840
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-840
             Project: CarbonData
          Issue Type: Improvement
          Components: data-query
            Reporter: Cao, Lionel
            Assignee: Cao, Lionel


Currently limit query will still scan all data first and limit in the last step. In carbon we can convert limit to filters with dictionary distinct value list...




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Created] (CARBONDATA-840) Limit query performance optimization [Group By]

Cao Lu 曹鲁
Hi dev,
Please help review this PR https://github.com/apache/incubator-carbondata/pull/716
It can improve performance of queries with below pre-condition:
— No Sort
— No Filters(Where clause)
— No Having(a special Filter)
— Has Group by
— Has Limit

For example :
    Select A, B, C, sum(D)
    From t3
    Group by A, B, C
    Limit n

The solution is to convert limit condition as Filters. It take the advantage of CarbonData Dictionary files to get the distinct value list and generate the filters so that we can reduce the scan IO.
The optimization happens at CarbonOptimizor Step and will not impact the physical steps.

Test Result:
Environment: MacOS X EI Capitan
IntelliJ IDEA + Spark 1.6.2 + CarbonData 1.1.0

Data Volumn: 100M rows

Query duration(ms):
Before: 128532, 129150
After: 4012, 1002, 988

Future:
1. The optimisation is only applied on spark1.6 for now, we can extend to spark2.0 later.
2. TBD

Thanks,
Cao, Lu



On 3/31/17, 2:10 PM, "Cao, Lionel (JIRA)" <[hidden email]<mailto:[hidden email]>> wrote:

Cao, Lionel created CARBONDATA-840:
--------------------------------------

             Summary: Limit query performance optimization [Group By]
                 Key: CARBONDATA-840
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-840
             Project: CarbonData
          Issue Type: Improvement
          Components: data-query
            Reporter: Cao, Lionel
            Assignee: Cao, Lionel


Currently limit query will still scan all data first and limit in the last step. In carbon we can convert limit to filters with dictionary distinct value list...




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

邮件免责申明----- 该电子邮件中的信息是保密的,除收件人外任何人无权访问此电子邮件。 如果您不是收件人,公开、复制、分发或基于此封邮件的任何行动,都是禁止的,并可能是违法的。该邮件包含的任何意见与建议均应遵循上汽集团关于信息传递与保密的制度或规定。除经上汽集团信函以正式书面方式确认外,任何相关的内容或信息不得作为正式依据。 Email Disclaimer----- The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Any opinions or advice contained in this email are subject to the terms and conditions expressed in the governing SAICMOTOR client engagement letter and should not be relied upon unless they are confirmed in writing on SAICMOTOR's letterhead.