Apache CarbonData Dev Mailing List archive

Re: [Discussion] Optimize the Update Performance

Posted by akashrn5 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Optimize-the-Update-Performance-tp96001p96039.html

Hi march,

Thanks for suggesting improvemnt on update.
I have gone through the paper for some highlights and here are few points
with my understanding, we can work and discuss more.

1. Since we are talking about the updating the existing file instead of new
carbon data file which
is the current logic,
can you please explain, I think we cant update the existing block

2. When you say tail page, you said it will be appended to base column
page. it will be like updating file.

I have one more suggestion, how about instead appending tail page to
existing column page and write it as separate page outside block file?

Do you already have points to explain these, or any doc?

3. if the operation is just delete operation, do we need to write tail
page or just making the row id of base page as invisible?
if we write the tail page, we should store some default value like we store
for null value to indicate delete data.

5. updating existing file we cant do
i havent got the clear picture about it, can you please explain in more
detail level?
i find bit of a dilemmas here with respect to carbondata.

I think many doubts will be clear if we have a low level design for it and
we do a POC. And is it really gona increase the update speed or we are
targetting just the scan speed?

Correct me if im wrong in my understanding about any of the above points.

Thanks

Regards,
Akash

On Wed, May 13, 2020 at 7:31 AM haomarch <marchpure@126.com> wrote:

> There is an interesting paper "L-Store: A Real-time OLTP and OLAP System",
> which uses an creative way to improve update performance.
>
> The Idea is:
> *1. Store the updated column value in the tail page*.
> When update any column of a record, a new tail page is created and appended
> to the page dictionary.
> In the tail page, only the updated column value is stored, comparing with
> the current implement of carbondata in which we write the whole row even
> only a few columns are updated, L-Store's way can avoid write amplification
> effectively.
> In the tail page, the rowid and updatedcolumnid are also stored together
> with the updated columnvalue,
> based on the updatedcolumnid, the row data can be achievd by read the base
> page and tail pages during query processing.
> *2. Increment update in the tail page.*
> Assume that we update 2 columns，1 column per update. There are two ways to
> store update columns in the tail page:
>
> 2.1: Non-incremental Update:
> / basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
> v3>
> tailpage1 <updatecolumn1, v1'>
> tailpage2 <updatecolumn2, v2'>/
>
> 2.2: Incremental Update:
> / basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
> v3>
> tailpage1 <updatecolumn1, v1'>
> tailpage2 <updatecolumn1, v1'> <updatecolumn2, v2'>/
>
> Non-incremental Update only stores the updated column value for this
> update,
> which has lower write amplification but worse query performance.
> incremental Update stores the update column value for this updated together
> the updated column values of previous updates, which has higher write
> amplification but better query performance.
>
> We shall study the work of L-Store, and optimize the update performance, it
> will carbondata's competitiveness
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>