Apache CarbonData Dev Mailing List archive - Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Apache CarbonData Dev Mailing List archive

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by akashnilugal@gmail.com on Oct 01, 2019; 9:49am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Support-Time-Series-for-MV-datamap-and-autodatamap-loading-of-timeseries-datamaps-tp84721p84964.html

Hi Babu,

Thanks for the inputs. Please find the comments
1. I will change from Union to UnionAll
2. For auto datamap loading, once the data is loaded to lower level granularity datamap, then we load the higher level datamap from the lower level datamap. But as per your point, i think you are telling to load from main table itself.
3. similar to 2nd point, whether to need configuration or not we can decide i think.
4. a. I think the max of the datamap is required to decide the range for the load, because in case of failure case, we may need.
b. This point will be taken care.
5. Yes, dataload is sync based on current design, as it is non lazy, it will happen with main table load only.
6. Yes, this will be handled.
7. Already added a task in jira.
On 2019/10/01 08:50:05, babu lal jangir <[hidden email]> wrote:

> Hi Akash, Thanks for Time Series DataMap proposal.
> Please check below Points.
>
> 1. During Query Planing Change Union to Union All , Otherwise will loose
> row if same value appears.
> 2. Whether system start load for next granularity level table as soon it
> matches the data condition or next granularity level table has to wait till
> current granularity level table is finished ? please handle if possible.
> 3. Add Configuration to load multiple Ranges at a time(across granularity
> tables).
> 4. Please check if Current data loading min ,max is enough to find current
> load . No need to refer the DataMap's min,max because data loading Range
> prepration can go wrong if loading happens from multiple driver . i think
> below rules are enough for loading.
> 4.a. Create MV should should sync data. On any failure Rebuild should
> sync again till than MV will be disabled.
> 4.b. Each load has independent Ranges and should load only those
> ranges. Any failure MV may go in disable state(only if intermediate ranges
> load is failed ,last loads failure will NOT make MV disable).
> 5. We can make Data loading sync because anyway queries can be served from
> fact table if any segments is in-progress in Datamap.
> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
> still we can continue loading next level data. (ignore if already handled).
> For Example.
> DataMaps:- Hour,Day,Month Level
> Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
> Failure in hour level during below range
> 2018-01-06 01:00:00 to 2018-01-06 01:00:00
> This point of time Hour level has 5 day data.so start loading on day
> level .
> 7. Add SubTask to support loading of in-between missing time.(Incremental
> but old records if timeseries device stopped working for some time).
>
> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <[hidden email]>
> wrote:
>
> > Hi vishal,
> >
> > In the design document, in the impacted analysis section, there is a topic
> > compatibility/legacy stores, so basically For old tables when the datamap
> > is created, we load all the timeseries datamaps with different granularity.
> > I think this should do fine, please let me know for further
> > suggestions/comments.
> >
> > Regards,
> > Akash R Nilugal
> >
> > On 2019/09/30 17:09:44, Kumar Vishal <[hidden email]> wrote:
> > > Hi Akash,
> > >
> > > In this desing document you haven't mentioned how to handle data loading
> > > for timeseries datamap for older segments[Existing table].
> > > If the customer's main table data is also stored based on time[increasing
> > > time] in different segments,he can use this feature as well.
> > >
> > > We can discuss and finalize the solution.
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > > On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <[hidden email]>
> > > wrote:
> > >
> > > > Hi Ajantha,
> > > >
> > > > Thanks for the queries and suggestions
> > > >
> > > > 1. Yes, this is a good suggestion, i ll include this change. Both date
> > and
> > > > timestamp columns are supported, will be updated in document.
> > > > 2. yes, you are right.
> > > > 3. you are right, if the day level is not available, then we will try
> > to
> > > > get the whole day data from hour level, if not availaible, as
> > explained in
> > > > design document, we will get the data from datamap UNION data from main
> > > > table based on user query.
> > > >
> > > > Regards,
> > > > Akash R Nilugal
> > > >
> > > >
> > > > On 2019/09/30 06:56:45, Ajantha Bhat <[hidden email]> wrote:
> > > > > + 1 ,
> > > > >
> > > > > I have some suggestions and questions.
> > > > >
> > > > > 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> > > > > 'timeseries_column'.
> > > > > so that it won't give an impression that only time stamp datatype is
> > > > > supported and update the document with all the datatype supported.
> > > > >
> > > > > 2. Querying on this datamap table is also supported right ?
> > supporting
> > > > > changing plan for main table to refer datamap table is for user to
> > avoid
> > > > > changing his query or any other reason ?
> > > > >
> > > > > 3. If user has not created day granularity datamap, but just created
> > hour
> > > > > granularity datamap. When query has day granularity, data will be
> > fetched
> > > > > form hour granularity datamap and aggregated ? or data is fetched
> > from
> > > > main
> > > > > table ?
> > > > >
> > > > > Thanks,
> > > > > Ajantha
> > > > >
> > > > > On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
> > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hi xuchuanyin,
> > > > > >
> > > > > > Thanks for the comments/Suggestions
> > > > > >
> > > > > > 1. Preaggregate is productized, but not the timeseries with
> > > > preaggregate,
> > > > > > i think you got confused with that, if im right.
> > > > > > 2. Limitations like, auto sampling or rollup, which we will be
> > > > supporting
> > > > > > now. Retention policies. etc
> > > > > > 3. segmentTimestampMin, this i will consider in design.
> > > > > > 4. RP is added as a separate task, i thought instead of
> > maintaining two
> > > > > > variables better to maintabin one and parse it. But i will consider
> > > > your
> > > > > > point based on feasibility during implementation.
> > > > > > 5. We use an accumulator which takes list, so before writing index
> > > > files
> > > > > > we take the min max of the timestamp column and fill in
> > accumulator and
> > > > > > then we can access accumulator.value in driver after load is
> > finished.
> > > > > >
> > > > > > Regards,
> > > > > > Akash R Nilugal
> > > > > >
> > > > > > On 2019/09/28 10:46:31, xuchuanyin <[hidden email]> wrote:
> > > > > > > Hi akash, glad to see the feature proposed and I have some
> > questions
> > > > > > about
> > > > > > > this. Please notice that some of the following descriptions are
> > > > comments
> > > > > > > followed by '===' described in the design document attached in
> > the
> > > > > > > corresponding jira.
> > > > > > >
> > > > > > > 1.
> > > > > > > "Currently carbondata supports timeseries on preaggregate
> > datamap,
> > > > but
> > > > > > its
> > > > > > > an alpha feature"
> > > > > > > ===
> > > > > > > It has been some time since the preaggregate datamap was
> > introduced
> > > > and
> > > > > > it
> > > > > > > is still **alpha**, why it is still not product-ready? Will the
> > new
> > > > > > feature
> > > > > > > also come into the similar situation?
> > > > > > >
> > > > > > > 2.
> > > > > > > "there are so many limitations when we compare and analyze the
> > > > existing
> > > > > > > timeseries database or projects which supports time series like
> > > > apache
> > > > > > druid
> > > > > > > or influxdb"
> > > > > > > ===
> > > > > > > What are the actual limitations? Besides, please give an example
> > of
> > > > this.
> > > > > > >
> > > > > > > 3.
> > > > > > > "Segment_Timestamp_Min"
> > > > > > > ===
> > > > > > > Suggest using camel-case style like 'segmentTimestampMin'
> > > > > > >
> > > > > > > 4.
> > > > > > > "RP is way of telling the system, for how long the data should be
> > > > kept"
> > > > > > > ===
> > > > > > > Since the function is simple, I'd suggest using
> > 'retentionTime'=15
> > > > and
> > > > > > > 'timeUnit'='day' instead of 'RP'='15_days'
> > > > > > >
> > > > > > > 5.
> > > > > > > "When the data load is called for main table, use an spark
> > > > accumulator to
> > > > > > > get the maximum value of timestamp in that load and return to the
> > > > load."
> > > > > > > ===
> > > > > > > How can you get the spark accumulator? The load is launched using
> > > > > > > loading-by-dataframe not using global-sort-by-spark.
> > > > > > >
> > > > > > > 6.
> > > > > > > For the rest of the content, still reading.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sent from:
> > > > > >
> > > >
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>