Apache CarbonData Dev Mailing List archive

[Discussion] Support pre-aggregate table to improve OLAP performance

Classic

List

Threaded

12 messages Options

Jacky Li

[Discussion] Support pre-aggregate table to improve OLAP performance

Hi community,

In traditional data warehouse, pre-aggregate table or cube is a common technology to improve OLAP query performance. To take carbondata support for OLAP to next level, I’d like to propose pre-aggregate table support in carbondata.

Please refer to CARBONDATA-1516 <https://issues.apache.org/jira/browse/CARBONDATA-1516> and the design document attached in the JIRA ticket (https://issues.apache.org/jira/browse/CARBONDATA-1516 <https://issues.apache.org/jira/browse/CARBONDATA-1516>)

This design is still in initial phase, proposed usage and SQL syntax are subject to change. Please provide your comment to improve this feature.
Any suggestion on the design from community is welcomed.

Regards,
Jacky Li

Liang Chen

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

Administrator

Hi Jacky

Thanks for you started this discussion, this is a great feature in
carbondata.

One question:
For sub_jar "Handle alter table scenarios for aggregation table", please
give more detail info.
Just i viewed the pdf attachment as below, looks no need to do any handles
for agg table if users do alter for main table. so can you provide more
detail, which scenarios need to be handled?
------------------------------------------------------------------------------------------
Adding of new column will not impact agg table.
Deleting or renaming existing column may invalidate agg tables, if it
invalidate, the operation
will be rejected.
User can manually perform this operation and rebuild pre-aggregate table as
update
scenario.

Regards
Liang

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Jacky Li

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

Hi Liang,

For alter table, data update/delete, and delete segment, they are the same.
So I write in document “ User can manually perform this operation and rebuild pre-aggregate table as
update scenario”
User need to drop the associated aggregate table and perform alter table, or data update/delete, or delete segment operation, then he can create the pre-agg table using CTAS command again, and the pre-aggregate table will be rebuilt.

Regards,
Jacky

> 在 2017年10月15日，下午2:50，Liang Chen <[hidden email]> 写道：
>
> Hi Jacky
>
> Thanks for you started this discussion, this is a great feature in
> carbondata.
>
> One question:
> For sub_jar "Handle alter table scenarios for aggregation table", please
> give more detail info.
> Just i viewed the pdf attachment as below, looks no need to do any handles
> for agg table if users do alter for main table. so can you provide more
> detail, which scenarios need to be handled?
> ------------------------------------------------------------------------------------------
> Adding of new column will not impact agg table.
> Deleting or renaming existing column may invalidate agg tables, if it
> invalidate, the operation
> will be rejected.
> User can manually perform this operation and rebuild pre-aggregate table as
> update
> scenario.
>
> Regards
> Liang
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

lionel061201

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

Hi Jacky,
If user want to create a cube on main table, does he/she have to create
multiple pre-aggregate tables? It will be a heavy workload to write so many
CTAS commands. If user only need create a few pre-agg tables, current
carbon already can support this requirement, user can create table first
and then use insert into select statement. The only different is user need
to query the pre-agg table instead of main table.

So maybe we can enable user to create a cube model( in schema or metafile?)
which contains multiple pre-aggregation definition and carbon can create
those pre-agg tables automatically according to the model. That would be
more easy for using and maintenance.

Regards,
Lionel

On Sun, Oct 15, 2017 at 3:56 PM, Jacky Li <[hidden email]> wrote:

> Hi Liang,
>
> For alter table, data update/delete, and delete segment, they are the same.
> So I write in document “ User can manually perform this operation and
> rebuild pre-aggregate table as
> update scenario”
> User need to drop the associated aggregate table and perform alter table,
> or data update/delete, or delete segment operation, then he can create the
> pre-agg table using CTAS command again, and the pre-aggregate table will be
> rebuilt.
>
> Regards,
> Jacky
>
> > 在 2017年10月15日，下午2:50，Liang Chen <[hidden email]> 写道：
> >
> > Hi Jacky
> >
> > Thanks for you started this discussion, this is a great feature in
> > carbondata.
> >
> > One question:
> > For sub_jar "Handle alter table scenarios for aggregation table", please
> > give more detail info.
> > Just i viewed the pdf attachment as below, looks no need to do any
> handles
> > for agg table if users do alter for main table. so can you provide more
> > detail, which scenarios need to be handled?
> > ------------------------------------------------------------
> ------------------------------
> > Adding of new column will not impact agg table.
> > Deleting or renaming existing column may invalidate agg tables, if it
> > invalidate, the operation
> > will be rejected.
> > User can manually perform this operation and rebuild pre-aggregate table
> as
> > update
> > scenario.
> >
> > Regards
> > Liang
> >
> >
> > --
> > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>
>
>
>

Jacky Li

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

Hi Lu Cao,

In my previous experience on “cube” engine, no matter it is ROLAP or MOLAP, it is something above SQL layer, because it not only need user to establish cube schema by transform metadata from datawarehouse star schema but also the engine defines its own query language like MDX, and many times these languages are not standardized so that different vendor need to provide different BI tools or adaptors for it.
So, although some vendor provides easy-to-use cube management tool, but it at least has two problems: vendor locking and the rigid of the cube mode once it defines. I think these problems are similar as in other vendor specific solution.

Currently one of the strength that carbon store provides is that it complies to standard SQL support by integrating with SparkSQL, Hive, etc. The intention of providing pre-aggregate table support is, it can enable carbon improve OLAP query performance but still stick with standard SQL support, it means all users still can use the same BI/JDBC application/tool which can connect to SparkSQL, Hive, etc.

If carbon should support “cube”, not only need to defines its configuration which may be very complex and non-standard, but also will force user to use vendor specific tools for management and visualization. So, I think before going to this complexity, it is better to provide pre-agg table as the first step.

Although we do not want the full complexity of “cube” on arbitrary data schema, but one special case is for timeseries data. Because time dimension hierarchy (year/month/day/hour/minute/second) is naturally understandable and it is consistent in all scenarios, so we can provide native support for pre-aggregate table on time dimension. Actually it is a cube on time and we can do automatic rollup for all levels in time.

Finally, please note that, by using CTAS syntax, we are not restricting carbon to support pre-aggreagate table only, but also arbitrary materialized view, if we want in the future.

Hope this make things more clear.

Regards,
Jacky

like mandarin provides, Actually, as you can see in the document, I am avoiding to call this “cube”.

> 在 2017年10月15日，下午9:18，Lu Cao <[hidden email]> 写道：
>
> Hi Jacky,
> If user want to create a cube on main table, does he/she have to create
> multiple pre-aggregate tables? It will be a heavy workload to write so many
> CTAS commands. If user only need create a few pre-agg tables, current
> carbon already can support this requirement, user can create table first
> and then use insert into select statement. The only different is user need
> to query the pre-agg table instead of main table.
>
> So maybe we can enable user to create a cube model( in schema or metafile?)
> which contains multiple pre-aggregation definition and carbon can create
> those pre-agg tables automatically according to the model. That would be
> more easy for using and maintenance.
>
> Regards,
> Lionel
>
> On Sun, Oct 15, 2017 at 3:56 PM, Jacky Li <[hidden email]> wrote:
>
>> Hi Liang,
>>
>> For alter table, data update/delete, and delete segment, they are the same.
>> So I write in document “ User can manually perform this operation and
>> rebuild pre-aggregate table as
>> update scenario”
>> User need to drop the associated aggregate table and perform alter table,
>> or data update/delete, or delete segment operation, then he can create the
>> pre-agg table using CTAS command again, and the pre-aggregate table will be
>> rebuilt.
>>
>> Regards,
>> Jacky
>>
>>> 在 2017年10月15日，下午2:50，Liang Chen <[hidden email]> 写道：
>>>
>>> Hi Jacky
>>>
>>> Thanks for you started this discussion, this is a great feature in
>>> carbondata.
>>>
>>> One question:
>>> For sub_jar "Handle alter table scenarios for aggregation table", please
>>> give more detail info.
>>> Just i viewed the pdf attachment as below, looks no need to do any
>> handles
>>> for agg table if users do alter for main table. so can you provide more
>>> detail, which scenarios need to be handled?
>>> ------------------------------------------------------------
>> ------------------------------
>>> Adding of new column will not impact agg table.
>>> Deleting or renaming existing column may invalidate agg tables, if it
>>> invalidate, the operation
>>> will be rejected.
>>> User can manually perform this operation and rebuild pre-aggregate table
>> as
>>> update
>>> scenario.
>>>
>>> Regards
>>> Liang
>>>
>>>
>>> --
>>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
>> n5.nabble.com/
>>
>>
>>
>>

Liang Chen

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

Administrator

+1 , i agree with Jacky points.
As we know, carbondata already be able to get very good performance for
filter query scenarios through MDK index. supports pre-aggregate in 1.3.0
would improve aggregated query scenarios. so users can use one carbondata
to support all query cases(both filter and agg).

To Lu cao, you mentioned this solution to build cube schema, it is too
complex and there are many limitations, for example: the CUBE data can't
support query detail data etc.

Regards
Liang

Jacky Li wrote

> Hi Lu Cao,
>
> In my previous experience on “cube” engine, no matter it is ROLAP or
> MOLAP, it is something above SQL layer, because it not only need user to
> establish cube schema by transform metadata from datawarehouse star schema
> but also the engine defines its own query language like MDX, and many
> times these languages are not standardized so that different vendor need
> to provide different BI tools or adaptors for it.
> So, although some vendor provides easy-to-use cube management tool, but it
> at least has two problems: vendor locking and the rigid of the cube mode
> once it defines. I think these problems are similar as in other vendor
> specific solution.
>
> Currently one of the strength that carbon store provides is that it
> complies to standard SQL support by integrating with SparkSQL, Hive, etc.
> The intention of providing pre-aggregate table support is, it can enable
> carbon improve OLAP query performance but still stick with standard SQL
> support, it means all users still can use the same BI/JDBC
> application/tool which can connect to SparkSQL, Hive, etc.
>
> If carbon should support “cube”, not only need to defines its
> configuration which may be very complex and non-standard, but also will
> force user to use vendor specific tools for management and visualization.
> So, I think before going to this complexity, it is better to provide
> pre-agg table as the first step.
>
> Although we do not want the full complexity of “cube” on arbitrary data
> schema, but one special case is for timeseries data. Because time
> dimension hierarchy (year/month/day/hour/minute/second) is naturally
> understandable and it is consistent in all scenarios, so we can provide
> native support for pre-aggregate table on time dimension. Actually it is a
> cube on time and we can do automatic rollup for all levels in time.
>
> Finally, please note that, by using CTAS syntax, we are not restricting
> carbon to support pre-aggreagate table only, but also arbitrary
> materialized view, if we want in the future.
>
> Hope this make things more clear.
>
> Regards,
> Jacky
>
>
>
> like mandarin provides, Actually, as you can see in the document, I am
> avoiding to call this “cube”.
>
>
>> 在 2017年10月15日，下午9:18，Lu Cao <

> whucaolu@

> > 写道：
>>
>> Hi Jacky,
>> If user want to create a cube on main table, does he/she have to create
>> multiple pre-aggregate tables? It will be a heavy workload to write so
>> many
>> CTAS commands. If user only need create a few pre-agg tables, current
>> carbon already can support this requirement, user can create table first
>> and then use insert into select statement. The only different is user
>> need
>> to query the pre-agg table instead of main table.
>>
>> So maybe we can enable user to create a cube model( in schema or
>> metafile?)
>> which contains multiple pre-aggregation definition and carbon can create
>> those pre-agg tables automatically according to the model. That would be
>> more easy for using and maintenance.
>>
>> Regards,
>> Lionel
>>
>> On Sun, Oct 15, 2017 at 3:56 PM, Jacky Li <

> jacky.likun@

> > wrote:
>>
>>> Hi Liang,
>>>
>>> For alter table, data update/delete, and delete segment, they are the
>>> same.
>>> So I write in document “ User can manually perform this operation and
>>> rebuild pre-aggregate table as
>>> update scenario”
>>> User need to drop the associated aggregate table and perform alter
>>> table,
>>> or data update/delete, or delete segment operation, then he can create
>>> the
>>> pre-agg table using CTAS command again, and the pre-aggregate table will
>>> be
>>> rebuilt.
>>>
>>> Regards,
>>> Jacky
>>>
>>>> 在 2017年10月15日，下午2:50，Liang Chen <

> chenliang6136@

> > 写道：
>>>>
>>>> Hi Jacky
>>>>
>>>> Thanks for you started this discussion, this is a great feature in
>>>> carbondata.
>>>>
>>>> One question:
>>>> For sub_jar "Handle alter table scenarios for aggregation table",
>>>> please
>>>> give more detail info.
>>>> Just i viewed the pdf attachment as below, looks no need to do any
>>> handles
>>>> for agg table if users do alter for main table. so can you provide more
>>>> detail, which scenarios need to be handled?
>>>> ------------------------------------------------------------
>>> ------------------------------
>>>> Adding of new column will not impact agg table.
>>>> Deleting or renaming existing column may invalidate agg tables, if it
>>>> invalidate, the operation
>>>> will be rejected.
>>>> User can manually perform this operation and rebuild pre-aggregate
>>>> table
>>> as
>>>> update
>>>> scenario.
>>>>
>>>> Regards
>>>> Liang
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
>>> n5.nabble.com/
>>>
>>>
>>>
>>>

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

bhavya411

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

Hi Dev,

For the Pre Aggregate tables how will we handle subsequent loads, will we
be running the query on the whole table and calculating the aggregations
again and then deleting the existing segment and creating the new segments
for whole data. With the above approach as the data increases in the main
table the Loading time will also be increasing substantially. Other way is
to intelligently determine the new values by querying the latest segment
and using them in collaboration with the existing pre-aggregated tables.
Please share your thoughts about it in this discussion.

Regards
Bhavya

On Mon, Oct 16, 2017 at 4:53 PM, Liang Chen <[hidden email]> wrote:

> +1 , i agree with Jacky points.
> As we know, carbondata already be able to get very good performance for
> filter query scenarios through MDK index. supports pre-aggregate in 1.3.0
> would improve aggregated query scenarios. so users can use one carbondata
> to support all query cases(both filter and agg).
>
> To Lu cao, you mentioned this solution to build cube schema, it is too
> complex and there are many limitations, for example: the CUBE data can't
> support query detail data etc.
>
> Regards
> Liang
>
>
> Jacky Li wrote
> > Hi Lu Cao,
> >
> > In my previous experience on “cube” engine, no matter it is ROLAP or
> > MOLAP, it is something above SQL layer, because it not only need user to
> > establish cube schema by transform metadata from datawarehouse star
> schema
> > but also the engine defines its own query language like MDX, and many
> > times these languages are not standardized so that different vendor need
> > to provide different BI tools or adaptors for it.
> > So, although some vendor provides easy-to-use cube management tool, but
> it
> > at least has two problems: vendor locking and the rigid of the cube mode
> > once it defines. I think these problems are similar as in other vendor
> > specific solution.
> >
> > Currently one of the strength that carbon store provides is that it
> > complies to standard SQL support by integrating with SparkSQL, Hive, etc.
> > The intention of providing pre-aggregate table support is, it can enable
> > carbon improve OLAP query performance but still stick with standard SQL
> > support, it means all users still can use the same BI/JDBC
> > application/tool which can connect to SparkSQL, Hive, etc.
> >
> > If carbon should support “cube”, not only need to defines its
> > configuration which may be very complex and non-standard, but also will
> > force user to use vendor specific tools for management and visualization.
> > So, I think before going to this complexity, it is better to provide
> > pre-agg table as the first step.
> >
> > Although we do not want the full complexity of “cube” on arbitrary data
> > schema, but one special case is for timeseries data. Because time
> > dimension hierarchy (year/month/day/hour/minute/second) is naturally
> > understandable and it is consistent in all scenarios, so we can provide
> > native support for pre-aggregate table on time dimension. Actually it is
> a
> > cube on time and we can do automatic rollup for all levels in time.
> >
> > Finally, please note that, by using CTAS syntax, we are not restricting
> > carbon to support pre-aggreagate table only, but also arbitrary
> > materialized view, if we want in the future.
> >
> > Hope this make things more clear.
> >
> > Regards,
> > Jacky
> >
> >
> >
> > like mandarin provides, Actually, as you can see in the document, I am
> > avoiding to call this “cube”.
> >
> >
> >> 在 2017年10月15日，下午9:18，Lu Cao <
>
> > whucaolu@
>
> > > 写道：
> >>
> >> Hi Jacky,
> >> If user want to create a cube on main table, does he/she have to create
> >> multiple pre-aggregate tables? It will be a heavy workload to write so
> >> many
> >> CTAS commands. If user only need create a few pre-agg tables, current
> >> carbon already can support this requirement, user can create table first
> >> and then use insert into select statement. The only different is user
> >> need
> >> to query the pre-agg table instead of main table.
> >>
> >> So maybe we can enable user to create a cube model( in schema or
> >> metafile?)
> >> which contains multiple pre-aggregation definition and carbon can create
> >> those pre-agg tables automatically according to the model. That would be
> >> more easy for using and maintenance.
> >>
> >> Regards,
> >> Lionel
> >>
> >> On Sun, Oct 15, 2017 at 3:56 PM, Jacky Li <
>
> > jacky.likun@
>
> > > wrote:
> >>
> >>> Hi Liang,
> >>>
> >>> For alter table, data update/delete, and delete segment, they are the
> >>> same.
> >>> So I write in document “ User can manually perform this operation and
> >>> rebuild pre-aggregate table as
> >>> update scenario”
> >>> User need to drop the associated aggregate table and perform alter
> >>> table,
> >>> or data update/delete, or delete segment operation, then he can create
> >>> the
> >>> pre-agg table using CTAS command again, and the pre-aggregate table
> will
> >>> be
> >>> rebuilt.
> >>>
> >>> Regards,
> >>> Jacky
> >>>
> >>>> 在 2017年10月15日，下午2:50，Liang Chen <
>
> > chenliang6136@
>
> > > 写道：
> >>>>
> >>>> Hi Jacky
> >>>>
> >>>> Thanks for you started this discussion, this is a great feature in
> >>>> carbondata.
> >>>>
> >>>> One question:
> >>>> For sub_jar "Handle alter table scenarios for aggregation table",
> >>>> please
> >>>> give more detail info.
> >>>> Just i viewed the pdf attachment as below, looks no need to do any
> >>> handles
> >>>> for agg table if users do alter for main table. so can you provide
> more
> >>>> detail, which scenarios need to be handled?
> >>>> ------------------------------------------------------------
> >>> ------------------------------
> >>>> Adding of new column will not impact agg table.
> >>>> Deleting or renaming existing column may invalidate agg tables, if it
> >>>> invalidate, the operation
> >>>> will be rejected.
> >>>> User can manually perform this operation and rebuild pre-aggregate
> >>>> table
> >>> as
> >>>> update
> >>>> scenario.
> >>>>
> >>>> Regards
> >>>> Liang
> >>>>
> >>>>
> >>>> --
> >>>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> >>> n5.nabble.com/
> >>>
> >>>
> >>>
> >>>
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>

ravipesala

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

Hi Bhavya,

For pre-aggregate table load, we will not delete old data and calculate
aggregation every time. We load aggregation tables also incrementally along
with the main table. For suppose if we create an aggregation table on the
main table then aggregation table is calculated and loaded with the
existing data of the main table. For subsequent loads on the main load,
aggregation table also calculated incrementally only for new data and
loaded as a new segment.

Regards,
Ravindra.

On 17 October 2017 at 13:34, Bhavya Aggarwal <[hidden email]> wrote:

> Hi Dev,
>
> For the Pre Aggregate tables how will we handle subsequent loads, will we
> be running the query on the whole table and calculating the aggregations
> again and then deleting the existing segment and creating the new segments
> for whole data. With the above approach as the data increases in the main
> table the Loading time will also be increasing substantially. Other way is
> to intelligently determine the new values by querying the latest segment
> and using them in collaboration with the existing pre-aggregated tables.
> Please share your thoughts about it in this discussion.
>
> Regards
> Bhavya
>
> On Mon, Oct 16, 2017 at 4:53 PM, Liang Chen <[hidden email]>
> wrote:
>
> > +1 , i agree with Jacky points.
> > As we know, carbondata already be able to get very good performance for
> > filter query scenarios through MDK index. supports pre-aggregate in
> 1.3.0
> > would improve aggregated query scenarios. so users can use one
> carbondata
> > to support all query cases(both filter and agg).
> >
> > To Lu cao, you mentioned this solution to build cube schema, it is too
> > complex and there are many limitations, for example: the CUBE data can't
> > support query detail data etc.
> >
> > Regards
> > Liang
> >
> >
> > Jacky Li wrote
> > > Hi Lu Cao,
> > >
> > > In my previous experience on “cube” engine, no matter it is ROLAP or
> > > MOLAP, it is something above SQL layer, because it not only need user
> to
> > > establish cube schema by transform metadata from datawarehouse star
> > schema
> > > but also the engine defines its own query language like MDX, and many
> > > times these languages are not standardized so that different vendor
> need
> > > to provide different BI tools or adaptors for it.
> > > So, although some vendor provides easy-to-use cube management tool, but
> > it
> > > at least has two problems: vendor locking and the rigid of the cube
> mode
> > > once it defines. I think these problems are similar as in other vendor
> > > specific solution.
> > >
> > > Currently one of the strength that carbon store provides is that it
> > > complies to standard SQL support by integrating with SparkSQL, Hive,
> etc.
> > > The intention of providing pre-aggregate table support is, it can
> enable
> > > carbon improve OLAP query performance but still stick with standard SQL
> > > support, it means all users still can use the same BI/JDBC
> > > application/tool which can connect to SparkSQL, Hive, etc.
> > >
> > > If carbon should support “cube”, not only need to defines its
> > > configuration which may be very complex and non-standard, but also will
> > > force user to use vendor specific tools for management and
> visualization.
> > > So, I think before going to this complexity, it is better to provide
> > > pre-agg table as the first step.
> > >
> > > Although we do not want the full complexity of “cube” on arbitrary data
> > > schema, but one special case is for timeseries data. Because time
> > > dimension hierarchy (year/month/day/hour/minute/second) is naturally
> > > understandable and it is consistent in all scenarios, so we can provide
> > > native support for pre-aggregate table on time dimension. Actually it
> is
> > a
> > > cube on time and we can do automatic rollup for all levels in time.
> > >
> > > Finally, please note that, by using CTAS syntax, we are not restricting
> > > carbon to support pre-aggreagate table only, but also arbitrary
> > > materialized view, if we want in the future.
> > >
> > > Hope this make things more clear.
> > >
> > > Regards,
> > > Jacky
> > >
> > >
> > >
> > > like mandarin provides, Actually, as you can see in the document, I am
> > > avoiding to call this “cube”.
> > >
> > >
> > >> 在 2017年10月15日，下午9:18，Lu Cao <
> >
> > > whucaolu@
> >
> > > > 写道：
> > >>
> > >> Hi Jacky,
> > >> If user want to create a cube on main table, does he/she have to
> create
> > >> multiple pre-aggregate tables? It will be a heavy workload to write so
> > >> many
> > >> CTAS commands. If user only need create a few pre-agg tables, current
> > >> carbon already can support this requirement, user can create table
> first
> > >> and then use insert into select statement. The only different is user
> > >> need
> > >> to query the pre-agg table instead of main table.
> > >>
> > >> So maybe we can enable user to create a cube model( in schema or
> > >> metafile?)
> > >> which contains multiple pre-aggregation definition and carbon can
> create
> > >> those pre-agg tables automatically according to the model. That would
> be
> > >> more easy for using and maintenance.
> > >>
> > >> Regards,
> > >> Lionel
> > >>
> > >> On Sun, Oct 15, 2017 at 3:56 PM, Jacky Li <
> >
> > > jacky.likun@
> >
> > > > wrote:
> > >>
> > >>> Hi Liang,
> > >>>
> > >>> For alter table, data update/delete, and delete segment, they are the
> > >>> same.
> > >>> So I write in document “ User can manually perform this operation and
> > >>> rebuild pre-aggregate table as
> > >>> update scenario”
> > >>> User need to drop the associated aggregate table and perform alter
> > >>> table,
> > >>> or data update/delete, or delete segment operation, then he can
> create
> > >>> the
> > >>> pre-agg table using CTAS command again, and the pre-aggregate table
> > will
> > >>> be
> > >>> rebuilt.
> > >>>
> > >>> Regards,
> > >>> Jacky
> > >>>
> > >>>> 在 2017年10月15日，下午2:50，Liang Chen <
> >
> > > chenliang6136@
> >
> > > > 写道：
> > >>>>
> > >>>> Hi Jacky
> > >>>>
> > >>>> Thanks for you started this discussion, this is a great feature in
> > >>>> carbondata.
> > >>>>
> > >>>> One question:
> > >>>> For sub_jar "Handle alter table scenarios for aggregation table",
> > >>>> please
> > >>>> give more detail info.
> > >>>> Just i viewed the pdf attachment as below, looks no need to do any
> > >>> handles
> > >>>> for agg table if users do alter for main table. so can you provide
> > more
> > >>>> detail, which scenarios need to be handled?
> > >>>> ------------------------------------------------------------
> > >>> ------------------------------
> > >>>> Adding of new column will not impact agg table.
> > >>>> Deleting or renaming existing column may invalidate agg tables, if
> it
> > >>>> invalidate, the operation
> > >>>> will be rejected.
> > >>>> User can manually perform this operation and rebuild pre-aggregate
> > >>>> table
> > >>> as
> > >>>> update
> > >>>> scenario.
> > >>>>
> > >>>> Regards
> > >>>> Liang
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Sent from: http://apache-carbondata-dev-
> mailing-list-archive.1130556.
> > >>> n5.nabble.com/
> > >>>
> > >>>
> > >>>
> > >>>
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> > n5.nabble.com/
> >
>

--
Thanks & Regards,
Ravi

bill.zhou

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

In reply to this post by Jacky Li

hi Jacky & Ravindra, I have little more query about this design, thank you
very much can clarify my query.

1. if we support create aggreagation tables from two or more tabels join,
how to set the aggretate.parent?, whether can be like
'aggretate.parent'='fact1,dim1,dim1'
2. what's the agg table colum name ? for following create command it will be
as: user_id,name,c2, price ?
CREATE TABLE agg_sales
STORED BY 'carbondata'
TBLPROPERTIES ('aggregate.parent'='sales')
AS SELECT user_id,user_name as name, sum(quantity) as c2, avg(price) FROM
sales GROUP BY user_id.
3. if we create the dictioanry column in agg table, whether the dictionary
file will use the same one main table?

4. for rollup table main table creation: what's the mean for
timeseries.eventtime, granualarity? what's column can belong to this?
5. for rollup table main table creation: what's the mean for
‘timeseries.aggtype’ =’quantity:sum, max', it means the column quantity only
support sum, max ?

6. In both the above cases carbon generates the 4 pre-aggregation tables
automatically for
year, month, day and hour. (their table name will be prefixed with
agg_sales). -- in about cause only see the column hour, how to generate the
year, month and day ?

7.In internal implementation, carbon will create these table with
SORT_COLUMNS=’group by
column defined above’, so that filter group by query on main table will be
faster because it
can leverage the index in pre-aggregate tables. -- I suggstion user can
control the sort columns order
8. whether support merge index to agg table ? -- it is usefull.

Jacky Li wrote

> Hi community,
>
> In traditional data warehouse, pre-aggregate table or cube is a common
> technology to improve OLAP query performance. To take carbondata support
> for OLAP to next level, I’d like to propose pre-aggregate table support in
> carbondata.
>
> Please refer to CARBONDATA-1516
> <https://issues.apache.org/jira/browse/CARBONDATA-1516> and the
> design document attached in the JIRA ticket
> (https://issues.apache.org/jira/browse/CARBONDATA-1516
> <https://issues.apache.org/jira/browse/CARBONDATA-1516>)
>
> This design is still in initial phase, proposed usage and SQL syntax are
> subject to change. Please provide your comment to improve this feature.
> Any suggestion on the design from community is welcomed.
>
> Regards,
> Jacky Li

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Naresh P R

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

In reply to this post by Jacky Li

Hi,

I was going through the design document & need some clarification.

1) Can we support creating agg tables while creating main table ?

2) Need some help in understanding "Support rollup table for timeseries
data",

i) Do we have any provision to specify timezone of timestamp column ?
ii) Assuming granularity is till hour level, below 4 agg table will be
created
*agg table 1 : *order_time, year, month, day, hour, country, sex,
sum(quantity), max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, day, hour,
country, sex -- here year, month, day & hour are derived from order_time
timestamp column
*agg table 2 : *order_time, year, month, day, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, day, country, sex
*agg table 3 :* order_time, year, month, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, month, country, sex
*agg table 4 : *order_time, year, country, sex, sum(quantity),
max(quantity), count(user_id),
sum(price), avg(price) group by order_time, year, country, sex

Please correct me if my understanding is wrong & provide more insight if
possible.

3) All operations like load, Segment LCM, IUD on Agg tables should be
restricted except select & compaction, right?

4) For new loads in parent table, we are creating new segments in agg table.
Can we have segment to segment mapping with parent & agg table, so that
operations like delete segments, update, delete will not be restricted in
parent table.
Or Can we have a table level flag whether to restrict delete segments,
update, delete operations on parent table if agg tables exist

5) I assume filters also will be pushed to agg tables if all conditions
match agg table columns
eg., CREATE TABLE agg_sales TBLPROPERTIES (parent_table=”xx”) AS SELECT c1,
c2, count(distinct c3) FROM source GROUP BY c1, c2
SELECT c1, c2, count(distinct c3) FROM source where c1 = 'a' and c2 = 'x'
GROUP BY c1, c2 -- This will use agg table
SELECT c1, c2, count(distinct c3) FROM source where c1 = 'a' and c2 = 'x'
and c3 = 'y' GROUP BY c1, c2 -- This will not use agg table

Whether below query use agg table ?
SELECT c1, c2, count(distinct c3) as x FROM source GROUP BY c1, c2 having
x > 10
Whether having clause will be converted to filter on count(distinct c3)
column on agg table or whether spark handles having clause ?

6) Instead of user identifying & creating agg tables, can we try to use any
ML or stats to understand queries/tables & suggest aggregates ?
eg., based on column stats
----
Regards,
Naresh P R

On Fri, Oct 13, 2017 at 8:03 PM, Jacky Li <[hidden email]> wrote:

> Hi community,
>
> In traditional data warehouse, pre-aggregate table or cube is a common
> technology to improve OLAP query performance. To take carbondata support
> for OLAP to next level, I’d like to propose pre-aggregate table support in
> carbondata.
>
> Please refer to CARBONDATA-1516 <https://issues.apache.org/
> jira/browse/CARBONDATA-1516> and the design document attached in the JIRA
> ticket (https://issues.apache.org/jira/browse/CARBONDATA-1516 <
> https://issues.apache.org/jira/browse/CARBONDATA-1516>)
>
> This design is still in initial phase, proposed usage and SQL syntax are
> subject to change. Please provide your comment to improve this feature.
> Any suggestion on the design from community is welcomed.
>
> Regards,
> Jacky Li

ravipesala

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

In reply to this post by bill.zhou

Hi Bill,

Please find my comments.

1. We are not supporting join queries in this design so it will be always
one parent table for an aggregate table. We may consider the join queries
for creating aggregation queries in future.

2. Aggregation column name will be created internally and it would be line
agg_parentcolumnname.

3. Yes if we create aggtable on dictionary column of parent table then it
uses same parent dictionary. Aggregation table does not generate any
dictionary files.

4. time-series.eventtime is the time column of the main table, there should
be at least one timestamp column on the main table to create
timeseries tables. In design, the granularity is replaced with hierarchy it
means the user can give the time hierarchy like a minute, hour, day so
three aggregation tables of a minute , hour and day aggregation tables will
be created automatically and loaded the data to them for every load.

5. In new design v1.1 it is now changed please check the same.

6. As I mentioned above in new V1.1 design it got changed to hierarchy so
user can define his own time hierarchy.

7. Ok, we will discuss and check whether we can expose this SORT_COLUMNS
configuration on aggregation table. Even if we don't support now we can
expose in future.

8. Yes, merge index s applicable for aggregation table as well.

Regards,
Ravindra.

On 3 November 2017 at 09:05, bill.zhou <[hidden email]> wrote:

> hi Jacky & Ravindra, I have little more query about this design, thank you
> very much can clarify my query.
>
>
> 1. if we support create aggreagation tables from two or more tabels join,
> how to set the aggretate.parent?, whether can be like
> 'aggretate.parent'='fact1,dim1,dim1'
> 2. what's the agg table colum name ? for following create command it will
> be
> as: user_id,name,c2, price ?
> CREATE TABLE agg_sales
> STORED BY 'carbondata'
> TBLPROPERTIES ('aggregate.parent'='sales')
> AS SELECT user_id,user_name as name, sum(quantity) as c2, avg(price) FROM
> sales GROUP BY user_id.
> 3. if we create the dictioanry column in agg table, whether the dictionary
> file will use the same one main table?
>
> 4. for rollup table main table creation: what's the mean for
> timeseries.eventtime, granualarity? what's column can belong to this?
> 5. for rollup table main table creation: what's the mean for
> ‘timeseries.aggtype’ =’quantity:sum, max', it means the column quantity
> only
> support sum, max ?
>
> 6. In both the above cases carbon generates the 4 pre-aggregation tables
> automatically for
> year, month, day and hour. (their table name will be prefixed with
> agg_sales). -- in about cause only see the column hour, how to generate the
> year, month and day ?
>
> 7.In internal implementation, carbon will create these table with
> SORT_COLUMNS=’group by
> column defined above’, so that filter group by query on main table will be
> faster because it
> can leverage the index in pre-aggregate tables. -- I suggstion user can
> control the sort columns order
> 8. whether support merge index to agg table ? -- it is usefull.
>
>
> Jacky Li wrote
> > Hi community,
> >
> > In traditional data warehouse, pre-aggregate table or cube is a common
> > technology to improve OLAP query performance. To take carbondata support
> > for OLAP to next level, I’d like to propose pre-aggregate table support
> in
> > carbondata.
> >
> > Please refer to CARBONDATA-1516
> > <https://issues.apache.org/jira/browse/CARBONDATA-1516> and the
> > design document attached in the JIRA ticket
> > (https://issues.apache.org/jira/browse/CARBONDATA-1516
> > <https://issues.apache.org/jira/browse/CARBONDATA-1516>)
> >
> > This design is still in initial phase, proposed usage and SQL syntax are
> > subject to change. Please provide your comment to improve this feature.
> > Any suggestion on the design from community is welcomed.
> >
> > Regards,
> > Jacky Li
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>

--
Thanks & Regards,
Ravi

bill.zhou

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

hi Ravindra

as design mention is not support drop segment if create pre-agg table. so
if like this how to support the table data retention ?
Regards

ravipesala wrote

> Hi Bill,
>
> Please find my comments.
>
> 1. We are not supporting join queries in this design so it will be always
> one parent table for an aggregate table. We may consider the join queries
> for creating aggregation queries in future.
>
> 2. Aggregation column name will be created internally and it would be line
> agg_parentcolumnname.
>
> 3. Yes if we create aggtable on dictionary column of parent table then it
> uses same parent dictionary. Aggregation table does not generate any
> dictionary files.
>
> 4. time-series.eventtime is the time column of the main table, there
> should
> be at least one timestamp column on the main table to create
> timeseries tables. In design, the granularity is replaced with hierarchy
> it
> means the user can give the time hierarchy like a minute, hour, day so
> three aggregation tables of a minute , hour and day aggregation tables
> will
> be created automatically and loaded the data to them for every load.
>
> 5. In new design v1.1 it is now changed please check the same.
>
> 6. As I mentioned above in new V1.1 design it got changed to hierarchy so
> user can define his own time hierarchy.
>
> 7. Ok, we will discuss and check whether we can expose this SORT_COLUMNS
> configuration on aggregation table. Even if we don't support now we can
> expose in future.
>
> 8. Yes, merge index s applicable for aggregation table as well.
>
> Regards,
> Ravindra.
>
> On 3 November 2017 at 09:05, bill.zhou <

> zgcsky08@

> > wrote:
>
>> hi Jacky & Ravindra, I have little more query about this design, thank
>> you
>> very much can clarify my query.
>>
>>
>> 1. if we support create aggreagation tables from two or more tabels join,
>> how to set the aggretate.parent?, whether can be like
>> 'aggretate.parent'='fact1,dim1,dim1'
>> 2. what's the agg table colum name ? for following create command it will
>> be
>> as: user_id,name,c2, price ?
>> CREATE TABLE agg_sales
>> STORED BY 'carbondata'
>> TBLPROPERTIES ('aggregate.parent'='sales')
>> AS SELECT user_id,user_name as name, sum(quantity) as c2, avg(price) FROM
>> sales GROUP BY user_id.
>> 3. if we create the dictioanry column in agg table, whether the
>> dictionary
>> file will use the same one main table?
>>
>> 4. for rollup table main table creation: what's the mean for
>> timeseries.eventtime, granualarity? what's column can belong to this?
>> 5. for rollup table main table creation: what's the mean for
>> ‘timeseries.aggtype’ =’quantity:sum, max', it means the column quantity
>> only
>> support sum, max ?
>>
>> 6. In both the above cases carbon generates the 4 pre-aggregation tables
>> automatically for
>> year, month, day and hour. (their table name will be prefixed with
>> agg_sales). -- in about cause only see the column hour, how to generate
>> the
>> year, month and day ?
>>
>> 7.In internal implementation, carbon will create these table with
>> SORT_COLUMNS=’group by
>> column defined above’, so that filter group by query on main table will
>> be
>> faster because it
>> can leverage the index in pre-aggregate tables. -- I suggstion user can
>> control the sort columns order
>> 8. whether support merge index to agg table ? -- it is usefull.
>>
>>
>> Jacky Li wrote
>> > Hi community,
>> >
>> > In traditional data warehouse, pre-aggregate table or cube is a common
>> > technology to improve OLAP query performance. To take carbondata
>> support
>> > for OLAP to next level, I’d like to propose pre-aggregate table support
>> in
>> > carbondata.
>> >
>> > Please refer to CARBONDATA-1516
>> > <https://issues.apache.org/jira/browse/CARBONDATA-1516> and the
>> > design document attached in the JIRA ticket
>> > (https://issues.apache.org/jira/browse/CARBONDATA-1516
>> > <https://issues.apache.org/jira/browse/CARBONDATA-1516>)
>> >
>> > This design is still in initial phase, proposed usage and SQL syntax
>> are
>> > subject to change. Please provide your comment to improve this feature.
>> > Any suggestion on the design from community is welcomed.
>> >
>> > Regards,
>> > Jacky Li
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
>> n5.nabble.com/
>>
>
>
>
> --
> Thanks & Regards,
> Ravi

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/