Posted by
haomarch on
May 13, 2020; 2:01am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Optimize-the-Update-Performance-tp96001.html
There is an interesting paper "L-Store: A Real-time OLTP and OLAP System",
which uses an creative way to improve update performance.
The Idea is:
*1. Store the updated column value in the tail page*.
When update any column of a record, a new tail page is created and appended
to the page dictionary.
In the tail page, only the updated column value is stored, comparing with
the current implement of carbondata in which we write the whole row even
only a few columns are updated, L-Store's way can avoid write amplification
effectively.
In the tail page, the rowid and updatedcolumnid are also stored together
with the updated columnvalue,
based on the updatedcolumnid, the row data can be achievd by read the base
page and tail pages during query processing.
*2. Increment update in the tail page.*
Assume that we update 2 columns,1 column per update. There are two ways to
store update columns in the tail page:
2.1: Non-incremental Update:
/ basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
v3>
tailpage1 <updatecolumn1, v1'>
tailpage2 <updatecolumn2, v2'>/
2.2: Incremental Update:
/ basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
v3>
tailpage1 <updatecolumn1, v1'>
tailpage2 <updatecolumn1, v1'> <updatecolumn2, v2'>/
Non-incremental Update only stores the updated column value for this update,
which has lower write amplification but worse query performance.
incremental Update stores the update column value for this updated together
the updated column values of previous updates, which has higher write
amplification but better query performance.
We shall study the work of L-Store, and optimize the update performance, it
will carbondata's competitiveness
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/