Apache CarbonData Dev Mailing List archive

[Discussion] Optimize the Update Performance

Classic

List

Threaded

4 messages Options

haomarch

May 13, 2020; 2:01am

[Discussion] Optimize the Update Performance

There is an interesting paper "L-Store: A Real-time OLTP and OLAP System",
which uses an creative way to improve update performance.

The Idea is:
*1. Store the updated column value in the tail page*.
When update any column of a record, a new tail page is created and appended
to the page dictionary.
In the tail page, only the updated column value is stored, comparing with
the current implement of carbondata in which we write the whole row even
only a few columns are updated, L-Store's way can avoid write amplification
effectively.
In the tail page, the rowid and updatedcolumnid are also stored together
with the updated columnvalue,
based on the updatedcolumnid, the row data can be achievd by read the base
page and tail pages during query processing.
*2. Increment update in the tail page.*
Assume that we update 2 columns，1 column per update. There are two ways to
store update columns in the tail page:

2.1: Non-incremental Update:
/ basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
v3>
tailpage1 <updatecolumn1, v1'>
tailpage2 <updatecolumn2, v2'>/

2.2: Incremental Update:
/ basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
v3>
tailpage1 <updatecolumn1, v1'>
tailpage2 <updatecolumn1, v1'> <updatecolumn2, v2'>/

Non-incremental Update only stores the updated column value for this update,
which has lower write amplification but worse query performance.
incremental Update stores the update column value for this updated together
the updated column values of previous updates, which has higher write
amplification but better query performance.

We shall study the work of L-Store, and optimize the update performance, it
will carbondata's competitiveness

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

haomarch

May 14, 2020; 4:21am

Re: [Discussion] Optimize the Update Performance

I have serveral ideas to optimize the update performance:
1. Reduce the storage size of tupleId:
The tupleId is too long leading heavily shuffle IO overhead while join
change table with target table.
2. Avoid to convert String to UTF8String in the row processing.
Before write rows into delta files, The convertfrom string to UTFString
hamers some performance
Code: "UTF8String.fromString(row.getString(tupleId))"
3. For DELETE ops in the MergeDataCommand, we shouldn't joint the whole
columns of change table take part in the JOIN ops. Only the "key" column is
needed.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Ajantha Bhat

May 14, 2020; 4:31am

Re: [Discussion] Optimize the Update Performance

Hi !,
Update is still using converter step with bad record handing.

If it is update by dataframe scenario no need of bad record handling,
only for update by value case we can keep it.

This can give significant improvement as we already observed in insert flow.

I tried once to send it to new insert into flow. But because of implicit
column. Plan rearrange failed.
I didn't continue this because of other work.
May be I have to look into it again and see if it can work.

Thanks,
Ajantha

On Thu, May 14, 2020 at 9:51 AM haomarch <[hidden email]> wrote:

> I have serveral ideas to optimize the update performance:
> 1. Reduce the storage size of tupleId:
> The tupleId is too long leading heavily shuffle IO overhead while join
> change table with target table.
> 2. Avoid to convert String to UTF8String in the row processing.
> Before write rows into delta files, The convertfrom string to UTFString
> hamers some performance
> Code: "UTF8String.fromString(row.getString(tupleId))"
> 3. For DELETE ops in the MergeDataCommand, we shouldn't joint the whole
> columns of change table take part in the JOIN ops. Only the "key" column is
> needed.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

... [show rest of quote]

akashrn5

May 14, 2020; 7:32pm

Re: [Discussion] Optimize the Update Performance

This post was updated on May 15, 2020; 3:00pm.

In reply to this post by haomarch

Hi march,

Thanks for suggesting improvemnt on update.
I have gone through the paper for some highlights and here are few points
with my understanding, we can work and discuss more.

1. Since we are talking about the updating the existing file instead of new
carbon data file which
is the current logic,
can you please explain, I think we cant update the existing block

2. When you say tail page, you said it will be appended to base column
page. it will be like updating file.

I have one more suggestion, how about instead appending tail page to
existing column page and write it as separate page outside block file?

Do you already have points to explain these, or any doc?

3. if the operation is just delete operation, do we need to write tail
page or just making the row id of base page as invisible?
if we write the tail page, we should store some default value like we store
for null value to indicate delete data.

5. updating existing file we cant do
i havent got the clear picture about it, can you please explain in more
detail level?
i find bit of a dilemmas here with respect to carbondata.

I think many doubts will be clear if we have a low level design for it and
we do a POC. And is it really gona increase the update speed or we are
targetting just the scan speed?

Correct me if im wrong in my understanding about any of the above points.

Thanks

Regards,
Akash

On Wed, May 13, 2020 at 7:31 AM haomarch <marchpure@126.com> wrote:

> There is an interesting paper "L-Store: A Real-time OLTP and OLAP System",
> which uses an creative way to improve update performance.
>
> The Idea is:
> *1. Store the updated column value in the tail page*.
> When update any column of a record, a new tail page is created and appended
> to the page dictionary.
> In the tail page, only the updated column value is stored, comparing with
> the current implement of carbondata in which we write the whole row even
> only a few columns are updated, L-Store's way can avoid write amplification
> effectively.
> In the tail page, the rowid and updatedcolumnid are also stored together
> with the updated columnvalue,
> based on the updatedcolumnid, the row data can be achievd by read the base
> page and tail pages during query processing.
> *2. Increment update in the tail page.*
> Assume that we update 2 columns，1 column per update. There are two ways to
> store update columns in the tail page:
>
> 2.1: Non-incremental Update:
> / basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
> v3>
> tailpage1 <updatecolumn1, v1'>
> tailpage2 <updatecolumn2, v2'>/
>
> 2.2: Incremental Update:
> / basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
> v3>
> tailpage1 <updatecolumn1, v1'>
> tailpage2 <updatecolumn1, v1'> <updatecolumn2, v2'>/
>
> Non-incremental Update only stores the updated column value for this
> update,
> which has lower write amplification but worse query performance.
> incremental Update stores the update column value for this updated together
> the updated column values of previous updates, which has higher write
> amplification but better query performance.
>
> We shall study the work of L-Store, and optimize the update performance, it
> will carbondata's competitiveness
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>