There is an interesting paper "L-Store: A Real-time OLTP and OLAP System",
which uses an creative way to improve update performance. The Idea is: *1. Store the updated column value in the tail page*. When update any column of a record, a new tail page is created and appended to the page dictionary. In the tail page, only the updated column value is stored, comparing with the current implement of carbondata in which we write the whole row even only a few columns are updated, L-Store's way can avoid write amplification effectively. In the tail page, the rowid and updatedcolumnid are also stored together with the updated columnvalue, based on the updatedcolumnid, the row data can be achievd by read the base page and tail pages during query processing. *2. Increment update in the tail page.* Assume that we update 2 columns,1 column per update. There are two ways to store update columns in the tail page: 2.1: Non-incremental Update: / basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3, v3> tailpage1 <updatecolumn1, v1'> tailpage2 <updatecolumn2, v2'>/ 2.2: Incremental Update: / basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3, v3> tailpage1 <updatecolumn1, v1'> tailpage2 <updatecolumn1, v1'> <updatecolumn2, v2'>/ Non-incremental Update only stores the updated column value for this update, which has lower write amplification but worse query performance. incremental Update stores the update column value for this updated together the updated column values of previous updates, which has higher write amplification but better query performance. We shall study the work of L-Store, and optimize the update performance, it will carbondata's competitiveness -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
I have serveral ideas to optimize the update performance:
1. Reduce the storage size of tupleId: The tupleId is too long leading heavily shuffle IO overhead while join change table with target table. 2. Avoid to convert String to UTF8String in the row processing. Before write rows into delta files, The convertfrom string to UTFString hamers some performance Code: "UTF8String.fromString(row.getString(tupleId))" 3. For DELETE ops in the MergeDataCommand, we shouldn't joint the whole columns of change table take part in the JOIN ops. Only the "key" column is needed. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi !,
Update is still using converter step with bad record handing. If it is update by dataframe scenario no need of bad record handling, only for update by value case we can keep it. This can give significant improvement as we already observed in insert flow. I tried once to send it to new insert into flow. But because of implicit column. Plan rearrange failed. I didn't continue this because of other work. May be I have to look into it again and see if it can work. Thanks, Ajantha On Thu, May 14, 2020 at 9:51 AM haomarch <[hidden email]> wrote: > I have serveral ideas to optimize the update performance: > 1. Reduce the storage size of tupleId: > The tupleId is too long leading heavily shuffle IO overhead while join > change table with target table. > 2. Avoid to convert String to UTF8String in the row processing. > Before write rows into delta files, The convertfrom string to UTFString > hamers some performance > Code: "UTF8String.fromString(row.getString(tupleId))" > 3. For DELETE ops in the MergeDataCommand, we shouldn't joint the whole > columns of change table take part in the JOIN ops. Only the "key" column is > needed. > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
This post was updated on .
In reply to this post by haomarch
Hi march,
Thanks for suggesting improvemnt on update. I have gone through the paper for some highlights and here are few points with my understanding, we can work and discuss more. 1. Since we are talking about the updating the existing file instead of new carbon data file which is the current logic, can you please explain, I think we cant update the existing block 2. When you say tail page, you said it will be appended to base column page. it will be like updating file. I have one more suggestion, how about instead appending tail page to existing column page and write it as separate page outside block file? Do you already have points to explain these, or any doc? 3. if the operation is just delete operation, do we need to write tail page or just making the row id of base page as invisible? if we write the tail page, we should store some default value like we store for null value to indicate delete data. 5. updating existing file we cant do i havent got the clear picture about it, can you please explain in more detail level? i find bit of a dilemmas here with respect to carbondata. I think many doubts will be clear if we have a low level design for it and we do a POC. And is it really gona increase the update speed or we are targetting just the scan speed? Correct me if im wrong in my understanding about any of the above points. Thanks Regards, Akash On Wed, May 13, 2020 at 7:31 AM haomarch <marchpure@126.com> wrote: > There is an interesting paper "L-Store: A Real-time OLTP and OLAP System", > which uses an creative way to improve update performance. > > The Idea is: > *1. Store the updated column value in the tail page*. > When update any column of a record, a new tail page is created and appended > to the page dictionary. > In the tail page, only the updated column value is stored, comparing with > the current implement of carbondata in which we write the whole row even > only a few columns are updated, L-Store's way can avoid write amplification > effectively. > In the tail page, the rowid and updatedcolumnid are also stored together > with the updated columnvalue, > based on the updatedcolumnid, the row data can be achievd by read the base > page and tail pages during query processing. > *2. Increment update in the tail page.* > Assume that we update 2 columns,1 column per update. There are two ways to > store update columns in the tail page: > > 2.1: Non-incremental Update: > / basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3, > v3> > tailpage1 <updatecolumn1, v1'> > tailpage2 <updatecolumn2, v2'>/ > > 2.2: Incremental Update: > / basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3, > v3> > tailpage1 <updatecolumn1, v1'> > tailpage2 <updatecolumn1, v1'> <updatecolumn2, v2'>/ > > Non-incremental Update only stores the updated column value for this > update, > which has lower write amplification but worse query performance. > incremental Update stores the update column value for this updated together > the updated column values of previous updates, which has higher write > amplification but better query performance. > > We shall study the work of L-Store, and optimize the update performance, it > will carbondata's competitiveness > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Free forum by Nabble | Edit this page |