Posted by
Naresh P R on
Nov 03, 2017; 7:41pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Support-pre-aggregate-table-to-improve-OLAP-performance-tp24040p25500.html
Hi,
I was going through the design document & need some clarification.
1) Can we support creating agg tables while creating main table ?
2) Need some help in understanding "Support rollup table for timeseries
data",
i) Do we have any provision to specify timezone of timestamp column ?
ii) Assuming granularity is till hour level, below 4 agg table will be
created
*agg table 1 : *order_time, year, month, day, hour, country, sex,
sum(quantity), max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, day, hour,
country, sex -- here year, month, day & hour are derived from order_time
timestamp column
*agg table 2 : *order_time, year, month, day, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, day, country, sex
*agg table 3 :* order_time, year, month, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, country, sex
*agg table 4 : *order_time, year, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, country, sex
Please correct me if my understanding is wrong & provide more insight if
possible.
3) All operations like load, Segment LCM, IUD on Agg tables should be
restricted except select & compaction, right?
4) For new loads in parent table, we are creating new segments in agg table.
Can we have segment to segment mapping with parent & agg table, so that
operations like delete segments, update, delete will not be restricted in
parent table.
Or Can we have a table level flag whether to restrict delete segments,
update, delete operations on parent table if agg tables exist
5) I assume filters also will be pushed to agg tables if all conditions
match agg table columns
eg., CREATE TABLE agg_sales TBLPROPERTIES (parent_table=”xx”) AS SELECT c1,
c2, count(distinct c3) FROM source GROUP BY c1, c2
SELECT c1, c2, count(distinct c3) FROM source where c1 = 'a' and c2 = 'x'
GROUP BY c1, c2 -- This will use agg table
SELECT c1, c2, count(distinct c3) FROM source where c1 = 'a' and c2 = 'x'
and c3 = 'y' GROUP BY c1, c2 -- This will not use agg table
Whether below query use agg table ?
SELECT c1, c2, count(distinct c3) as x FROM source GROUP BY c1, c2 having
x > 10
Whether having clause will be converted to filter on count(distinct c3)
column on agg table or whether spark handles having clause ?
6) Instead of user identifying & creating agg tables, can we try to use any
ML or stats to understand queries/tables & suggest aggregates ?
eg., based on column stats
----
Regards,
Naresh P R
On Fri, Oct 13, 2017 at 8:03 PM, Jacky Li <
[hidden email]> wrote:
> Hi community,
>
> In traditional data warehouse, pre-aggregate table or cube is a common
> technology to improve OLAP query performance. To take carbondata support
> for OLAP to next level, I’d like to propose pre-aggregate table support in
> carbondata.
>
> Please refer to CARBONDATA-1516 <
https://issues.apache.org/> jira/browse/CARBONDATA-1516> and the design document attached in the JIRA
> ticket (
https://issues.apache.org/jira/browse/CARBONDATA-1516 <
>
https://issues.apache.org/jira/browse/CARBONDATA-1516>)
>
> This design is still in initial phase, proposed usage and SQL syntax are
> subject to change. Please provide your comment to improve this feature.
> Any suggestion on the design from community is welcomed.
>
> Regards,
> Jacky Li