[Background]
Now update feature insert the updated rows into the old segments where the data are updated. In the end, it needs to reload the indexes of related segments. [Movitation] If there are many updated segments, it will take a long time to reload the indexes again. So I suggest writing the updated rows into a new segment. It will not impact the indexes of old segments and doesn't need to reload indexes. ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
Hi david,
Please check below points One advantage what we get here is , when we insert as new segment, it will take the new insert flow without converter step and that will be faster. But here are some points. 1. when you write for new segments for each update, the horizontal compaction in case of update does not make sense, as it wont happen with this idea. With this solution, horizontal compaction makes sense only in delete case. 2. you said we avoid reloading the indexes, but we will avoid reloading the indexes of complete segment(original segment on which update has happened), but we still need to reload the index of newly added segment which has updated right. 3. when you keep on adding multiple segments, we will have more number of segments and if we does not do compaction, that's one problem and the entry and size of metadata(table status) increases so much which is another problem. So how are you going to handle these cases? correct me if i'm wrong in my understanding. Regards, Akash -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Akash,
1. the update operation still has "deletdelta" files, it keeps the same with previous. horizontal compaction is still needed. 2. loading one carbonindexmerge file will fast, and not impact the query performance. (customer has faced this issue) 3. for insert/loading, it can trigger compaction to avoid small segments. ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
Hi David,
1. Yeah i already told that it will come in to picture in delete case, as update is (delete + insert). 2. yes, we will be loading the single merge file into cache, which can be little bit better compared to existing one. 3. I didnt get the complete ans actually, when exactly you plan to compact those and how to take care the increasing entries in the table status file thanks -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Akash,
3. Update operation contain a insert operation. Update operation will do the same thing how the insert operation process this issue. ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
Hi David. Thanks for proposing this.
*+1 from my side.* I have seen users with 200K segments table stored in cloud. It will be really slow to reload all the segments where update happened for indexes like SI, min-max, MV. So, it is good to write as a new segment and just load new segment indexes. (try to reuse this flow UpdateTableModel.loadAsNewSegment = true) and user can compact the segments to avoid many new segments created by update. and we can also move the compacted segments to table status history I guess to avoid more entries in table status. Thanks, Ajantha On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang <[hidden email]> wrote: > Hi Akash, > > 3. Update operation contain a insert operation. Update operation will > do the same thing how the insert operation process this issue. > > > > ----- > Best Regards > David Cai > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Hi David,
+1 Initially when segments concept is started, it is viewed as a folder which is incrementally added with time, so that data retention use-cases like "delete segments before a given date" were thought of. In that case if updated records are written into new segment, then old records will become new records and retention model will not work on that data. So update records were written to the same segment folder. But later as the partition concept was introduced, that will be a clean method to implement retention or even using a delete by time column is a better method. So inserting new records into the new segment makes sense. Only disadvantage can be later supporting one column data update/replace feature which Likun was mentioning previously. So to generalize, update feature can support inserting the updated records to new segment. The logic to reload indexes when segments are updated can still be there, however when there is no insert of data to old segments, reload of indexes needs to be avoided. Increasing the number of segments need not be a reason for this to go ahead, as the problem of increasing segments anyway is a problem and needs to be solved using compaction either horizontal or vertical. Also optimization of segment file storage either filebased or DB based(embedded or external) for too big deployments needs to be solved independently. Regards, Ramana On Sat, Sep 5, 2020 at 7:58 AM Ajantha Bhat <[hidden email]> wrote: > Hi David. Thanks for proposing this. > > *+1 from my side.* > > I have seen users with 200K segments table stored in cloud. > It will be really slow to reload all the segments where update happened for > indexes like SI, min-max, MV. > > So, it is good to write as a new segment > and just load new segment indexes. (try to reuse this flow > UpdateTableModel.loadAsNewSegment > = true) > > and user can compact the segments to avoid many new segments created by > update. > and we can also move the compacted segments to table status history I guess > to avoid more entries in table status. > > Thanks, > Ajantha > > > > On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang <[hidden email]> > wrote: > > > Hi Akash, > > > > 3. Update operation contain a insert operation. Update operation > will > > do the same thing how the insert operation process this issue. > > > > > > > > ----- > > Best Regards > > David Cai > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > |
+1
Already partition loading uses the new segment to write the update delta data. It is better to make consistent across all. Creating new segment simplifies the design. On Mon, 14 Sep 2020 at 1:48 AM, Venkata Gollamudi <[hidden email]> wrote: > Hi David, > > +1 > > > > Initially when segments concept is started, it is viewed as a folder which > > is incrementally added with time, so that data retention use-cases like > > "delete segments before a given date" were thought of. In that case if > > updated records are written into new segment, then old records will become > > new records and retention model will not work on that data. So update > > records were written to the same segment folder. > > > > But later as the partition concept was introduced, that will be a clean > > method to implement retention or even using a delete by time column is a > > better method. > > So inserting new records into the new segment makes sense. > > > > Only disadvantage can be later supporting one column data update/replace > > feature which Likun was mentioning previously. > > > > So to generalize, update feature can support inserting the updated records > > to new segment. The logic to reload indexes when segments are updated can > > still be there, however when there is no insert of data to old segments, > > reload of indexes needs to be avoided. > > > > Increasing the number of segments need not be a reason for this to go > > ahead, as the problem of increasing segments anyway is a problem and needs > > to be solved using compaction either horizontal or vertical. Also > > optimization of segment file storage either filebased or DB based(embedded > > or external) for too big deployments needs to be solved independently. > > > > Regards, > > Ramana > > > > On Sat, Sep 5, 2020 at 7:58 AM Ajantha Bhat <[hidden email]> wrote: > > > > > Hi David. Thanks for proposing this. > > > > > > *+1 from my side.* > > > > > > I have seen users with 200K segments table stored in cloud. > > > It will be really slow to reload all the segments where update happened > for > > > indexes like SI, min-max, MV. > > > > > > So, it is good to write as a new segment > > > and just load new segment indexes. (try to reuse this flow > > > UpdateTableModel.loadAsNewSegment > > > = true) > > > > > > and user can compact the segments to avoid many new segments created by > > > update. > > > and we can also move the compacted segments to table status history I > guess > > > to avoid more entries in table status. > > > > > > Thanks, > > > Ajantha > > > > > > > > > > > > On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang <[hidden email]> > > > wrote: > > > > > > > Hi Akash, > > > > > > > > 3. Update operation contain a insert operation. Update operation > > > will > > > > do the same thing how the insert operation process this issue. > > > > > > > > > > > > > > > > ----- > > > > Best Regards > > > > David Cai > > > > -- > > > > Sent from: > > > > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > > > > > > > -- Ravi |
PR#3999 already implemented this enhancement, please know.
PR URL: https://github.com/apache/carbondata/pull/3999 ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
Free forum by Nabble | Edit this page |