Login  Register

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

Posted by Naresh P R on Nov 03, 2017; 7:41pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Support-pre-aggregate-table-to-improve-OLAP-performance-tp24040p25500.html

Hi,

I was going through the design document & need some clarification.

1) Can we support creating agg tables while creating main table ?

2) Need some help in understanding "Support rollup table for timeseries
data",

i) Do we have any provision to specify timezone of timestamp column ?
ii) Assuming granularity is till hour level, below 4 agg table will be
created
*agg table 1 : *order_time, year, month, day, hour, country, sex,
sum(quantity), max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, day, hour,
country, sex -- here year, month, day & hour are derived from order_time
timestamp column
*agg table 2 : *order_time, year, month, day, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, day, country, sex
*agg table 3 :* order_time, year, month, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, country, sex
*agg table 4 : *order_time, year, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, country, sex

Please correct me if my understanding is wrong & provide more insight if
possible.

3) All operations like load, Segment LCM, IUD on Agg tables should be
restricted except select & compaction, right?

4) For new loads in parent table, we are creating new segments in agg table.
Can we have segment to segment mapping with parent & agg table, so that
operations like delete segments, update, delete will not be restricted in
parent table.
Or Can we have a table level flag whether to restrict delete segments,
update, delete operations on parent table if agg tables exist

5) I assume filters also will be pushed to agg tables if all conditions
match agg table columns
eg., CREATE TABLE agg_sales TBLPROPERTIES (parent_table=”xx”) AS SELECT c1,
c2, count(distinct c3) FROM source GROUP BY c1, c2
 SELECT c1, c2, count(distinct c3) FROM source where c1 = 'a' and c2 = 'x'
GROUP BY c1, c2 -- This will use agg table
 SELECT c1, c2, count(distinct c3) FROM source where c1 = 'a' and c2 = 'x'
and c3 = 'y' GROUP BY c1, c2 -- This will not use agg table

 Whether below query use agg table ?
 SELECT c1, c2, count(distinct c3) as x FROM source GROUP BY c1, c2 having
x > 10
Whether having clause will be converted to filter on count(distinct c3)
column on agg table or whether spark handles having clause ?

6) Instead of user identifying & creating agg tables, can we try to use any
ML or stats to understand queries/tables & suggest aggregates ?
eg., based on column stats
----
Regards,
Naresh P R

On Fri, Oct 13, 2017 at 8:03 PM, Jacky Li <[hidden email]> wrote:

> Hi community,
>
> In traditional data warehouse, pre-aggregate table or cube is a common
> technology to improve OLAP query performance. To take carbondata support
> for OLAP to next level, I’d like to propose pre-aggregate table support in
> carbondata.
>
> Please refer to CARBONDATA-1516 <https://issues.apache.org/
> jira/browse/CARBONDATA-1516> and the design document attached in the JIRA
> ticket (https://issues.apache.org/jira/browse/CARBONDATA-1516 <
> https://issues.apache.org/jira/browse/CARBONDATA-1516>)
>
> This design is still in initial phase, proposed usage and SQL syntax are
> subject to change. Please provide your comment to improve this feature.
> Any suggestion on the design from community is welcomed.
>
> Regards,
> Jacky Li