Apache CarbonData Dev Mailing List archive

[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

Classic

List

Threaded

16 messages Options

akashrn5

[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

This post was updated on .

Hi,

Currently in carbondata we have datamaps like preaggregate, lucene, bloom,
mv and we have
lazy and non-lazy methods to load data to datamaps. But lazy load is not
allowed for datamaps
like preagg, lucene, bloom.but, it is allowed for mv datamap. In lazy load
of mv datamap, for
every rebuild(load to datamap) we load the complete data of main table and
overwrite the existing
segment in datamap based on datamap query.

This is very costly in terms of performance and we also need to support the
lazy and non-lazy load
for all the datamaps. This can help in reduce the actual dataload time to
main table and whenever
user wants, he can do the lazy load for the datamaps present for that table.

Basically we need not overwrite the existing data every time we load to
datamap, we need to increment
the data in new segments similar to main table. This will help to get
better performance.

Please giveyour inputs or get back for any clarifications.

JIRA is created to track https://issues.apache.org/jira/browse/CARBONDATA-3296

Design document is present at https://docs.google.com/document/d/13XgEBUIqaAKdrlQftebr5BNOplL3u9qxuFe-IJUB3cM/edit#heading=h.h311u6t3pve9

Regards,
Akash

xuchuanyin

Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

Hi Akash, please note that if the index datamap supports lazy build, then there could be a chance that only some segments that have corresponding index data while some do not, which means that carbondata should handle this situation during query.

We cannot just discard the existing index otherwise the query performance will be affected.

Hope your design has handle this properly.

Thanks.

Sent from my Huawei phone

dhatchayani

Re: [DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

In reply to this post by akashrn5

Hi Akash,

It is good to have this feature and expecting to get good understanding of
the design and solution from the design document.

Please clarify the below points.

(1) How are we planning support lazy loads? Is there any one-to-one mapping
between the segments of the main table and the datamaps?

(2) In case if one-to-one mapping to be maintained, lets take pre-aggregate
datamaps, even though 'n' number of segments are loaded to the main table,
after 'n' loads if the user is trying to create a pre-aggregate datamap,
current pre-aggregate implementation creates only one segment, here the
one-to-one mapping is broken. Old store is in this way. What is the plan to
handle the old store?

(3) If the datamap is not updated/inline with the main table loads, the
system will fall back to the main table pruning? If so, is this a time
consuming process to check for the status of the datamaps for all the
queries, if needed to hit datamaps?

Hope all these points will be covered in the design document.

Thanks,
Dhatchayani

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

akashrn5

Re: [DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

Hi dhatchayani,

please find the comments below

1. Yes you are right, design document contains this, in datamap status file
we will add the mapping of synchrinization of main table and datamap, based
on that incremental load is done.
2. I will tell in general, if main table has 10 segments, and then you do
lazy load, then first segment of datamap is created with 10segments data,
after that 2 loads are done to main table, then for next lazy load, only two
segments data of main table are loaded to second segment of datamap. For
preaggregate we will keep same behavior as old.

3. Currently also, if you consider the MV datamap, it supports only lazy
load. So when the main table data and the datamap data are not synced then
the datamao is not selected for pruning, and everytime to decide this, we
read the datamap status file and make the plan. So no issues in the reading
status.

I hope i cleared your doubts, if any more doubts or suggestions, please get
back

Thank you

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

akashrn5

Re: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

In reply to this post by xuchuanyin

Hi xuchuanyin,

For index datamap we can have same behavior as mv datamap only, but it might
behave differently in case of lucene. This we can decide whether to enable
or lazy load or not.

Currently mv behavior is below

it supports only lazy load. So when the main table data and the datamap
data are not synced then the datamap is not selected for pruning. We read
the datamap status file and if datamap is disabled we do not considered
during query, as we will change the plan.

please get back for any more clarifications or suggestions

Thank you

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

“For index datamap we can have same behavior as mv datamap only”
===
Emm, That’s just what I am concerned. In your plan, if the index datamap is lazy, each time after a dataload is completed, this datamap will be ignored before rebuild index is fired for this segment, even though all the index data for the historical segments could be used during the time. In a word, I think this implementation is *UNACCEPTABLE*. Actually I also came across the ‘lazy index datamap’ feature months ago and gave up this idea because I thought this implementation will make this feature useless and no one will try it in the real world.

I strongly *recommend* the following implementation:
Each time after a data load is finished and before the index data for this segment is generated, if a query is fired,

1. For the historical segments which have index data generated already, carbondata will do pruning using the corresponding index datamap data and return the pruning result A;
2. For the historical segments and the newly generated segment which do not have the index data generated, carbondata will skip pruning using the corresponding index datamap and return the pruning result B;
3. Carbondata will use A union B as the pruning result from the driver side.

The above implementation means that carbondata should *support pruning by segment*.

akashrn5

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

I agree with you that the index created for old segments will be of no use if
rebuild is not happened and these are not considered in query for pruning.

But we go for datamap pruning (index) based on the datamap status. Status
will be just enabled or disabled. You cannot maintain status for each
segment, there is only one status. So based on this we go for pruning.

Currently we have deferred build, which is blocked for index datamaps,
preagg, BUT for MV datamap it is always considered as true. Everytime u need
to make rebuild after main table laod to enable the mv datamap. So basically
if u see, deferred is blocked for index and does not matter for MV as we
always have rebuild.

So are you telling to make everything as non-lazy? including MV?, this may
hit dataload performance. If you want to implement whatever u suggested,
need to maintain status for each segment whether for that segment it is
enable or not, which i doubt we shouldn't do.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

I think there is still misunderstanding between us.

Here I only concern about the lazy build for index datamap.

I think each segment should has its own datamap status and based on this we
can support pruning by index datamap for each segment. After this, even the
datamap is lazy, during query we can still make use of the index data of the
historical segments (which already have index data generated).

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

akashrn5

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

i got your point. If each segment has status file as i said, we can do
pruning without rebuild also. But need to get others suggestion on this
point. So may be we can take up this in another JIRA and track. In this jira
we can just suppport incremental data load.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

ravipesala

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

Hi Akash,

There is a difference between index datamap (like bloom) and olap datamaps
(like MV). Index datamaps used only for pruning the data while olap datamaps
will be used as pre-computed data which can be fetched directly as per
query.

In OLAP datamap case lazy build or deferred build makes sense as data needs
to be always synchronized with master data otherwise we will get stale data.
So any difference in synchronization will make the datamap disable. But on
the other hand Index datamap is used only for faster pruning so
synchronization with master data is not mandatory unless we have a mechanism
to prune synchronized data using index datamap and non-synchronized data
using default datamap. This is the same @xuchuanyin mentioned.

I feel this design is about OLAP datamap incremental loading so better not
do any changes in the behaviour of index datamaps. We can consider the
improvements of Index datamap in future but it should not be part of it.
Please update the design if mentioned anything related to Index datamap.

Regards,
Ravindra.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

+1 for ravin's advice.

We only support lazy/incremental load/rebuild for olap datamap (MV/preagg),
not for index datamap currently.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

akashrn5

RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

In reply to this post by ravipesala

hi ravindra,

Got your point. As i had replied to xuchuyain. We can take these index
datamap enhancement separately.

Thank you

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

qiuchenjian

Re: [DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

In reply to this post by akashrn5

Hi akashrn5
Can this feature support the following points, for MV datamap ？

1) lazy or non-lazy mode when using different option

2）reuse the existing MV data and generate new segment like you describe or
overwrite MV table directly when using different option

3) merge the MV table segments like main table, automatically and manually

Regards,
qiuchenjian

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Indhumathi

Re: [DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

Hi qiuchenjian,

*1) lazy or non-lazy mode when using different option *
Currently MV supports data load through only lazy mode. Incremental data
loading
will be supported for the same

*2）reuse the existing MV data and generate new segment like you describe or
overwrite MV table directly when using different option *
As per the 1st point, explanation given in above posts on how
incremental dataloading will be supported for mv will be followed. Even if
we support
non-lazy data load mode for MV, overwrite will not happen.

*3) merge the MV table segments like main table, automatically and
manually*
Yes. Both Auto compaction and Manual compaction will be supported for mv
datamap

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

chetdb

Re: [DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

In reply to this post by akashrn5

1. Will dataloading to MV be supported for all the table properties and load
options provided by Carbon. ?
2. Will the newly proposed "alter datamap datamapname compact" command
support all compaction types(auto/major/minor/custom) ?
3. Would load to partition table having MV also be supported ?

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Indhumathi

Re: [DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

Hello all,

Please find the updated design document for Incremental dataloading
at the below link.

https://docs.google.com/document/d/1AACOYmBpwwNdHjJLOub0utSc6JCBMZn8VL5CvZ9hygA/edit?usp=sharing

On Thu, Apr 4, 2019 at 12:29 PM chetdb <[hidden email]> wrote:

> 1. Will dataloading to MV be supported for all the table properties and
> load
> options provided by Carbon. ?
> 2. Will the newly proposed "alter datamap datamapname compact" command
> support all compaction types(auto/major/minor/custom) ?
> 3. Would load to partition table having MV also be supported ?
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>