Hi dev:
Currently CarbonData supports using compaction command to compact delta data into carbondata file, but it needs two or more segments to be compacted, if the size of these segments is big and user don't want to compact them(it needs to spend a lot of time), just want to compact delta data files into carbondata files for every segment. Discuss with Jacky and David offline, there is a way to do this: add new compaction type for compacting delta data files for each segment, for example: alter table table_name compact 'iud_delta' where segment.id in (0.2), this command will compact all delta data files of segment 0.2 into carbondata files as new segment 0.3. Any suggestion for this, thanks. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi,
I have some doubt, are you talking about the delete delta files ? Or delta data files ? Is it specific to update and delete scenarios? If compating is just within segment, it's similar to horizontal compaction in case of update and delete. So is it required to create a new segment? Thanks Regards, Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi, Akash R:
thanks for your reply. I am talking about the delta data files including update and delete delta files. Horizontal compaction just compacts all delta files into one in one segment, right? But if the size of segment is big and the size of update and delete delta file is big too, I think it will effect the query performance, because it needs to filter carbondata files and delta file, right? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
I guess your intention is to rewrite a single segment by merging base file and delta files to improve the query performance of that segment, right? I think this is doable and note that this operation may be time consuming since it is rewriting the whole segment.
Regards, Jacky > 在 2019年3月30日,上午10:36,xm_zzc <[hidden email]> 写道: > > Hi, Akash R: > thanks for your reply. > I am talking about the delta data files including update and delete delta > files. Horizontal compaction just compacts all delta files into one in one > segment, right? But if the size of segment is big and the size of update and > delete delta file is big too, I think it will effect the query performance, > because it needs to filter carbondata files and delta file, right? > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Hi Jacky:
Yes, my purpose is what you said. If there is one big segments including big delete and update delta files, it needs to find another segment to compact with that big one if users want to eliminate delta files, this operation will be more time consuming than the operation what i want. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xm_zzc
hi,
Thanks for clearing the doubt. So according to my understanding, bascially you want to merge all the delete delta files and base carbondtaa files and write a new segment. basically this helps to reduce IO right? So here i have some questions regarding that 1. are you planning for a new DDL for this operation? if you are, then DDL structure? 2. how about the concurrency will handled with this? like update and delete, compaction to table when this compaction is progress? if concurrent operations are blocked well and good, else how the segment mapping wil be maintained? 3. As jacky said, i agee with him, this will be costly operation as you will be writing the whole segment again and time consuming, how this will be handled so that user wont be blocked for query or other operation. or is it recommended to do this operation in offpeak hours? I suggest, can you please add the design document and create a JIRA for this, it would be helpful. Thanks. Regards, Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi:
Just as I said before, we can add a compaction type called 'iud_delta_compact' for command 'alter table table_name compact' to support this feature. The concurrency for this feature will be handled as the same as other compaction types, and it's recommended to do this operation in off peak hours. I will create a jira for this feature later. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by akashrn5
Hi:
Just as I said before, we can add a new compaction type called 'iud_delta_compact' for command 'alter table table_name compact' to support this feature. The concurrency of this feature will be handled as the same as other compaction type, and it's recommended to do this operation in off peak hours. I will create a jira task for this feature later. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xm_zzc
hi,
Thanks for reply. Once you create jira and design document is ready, we can further decide the impact and any other things to handle. Thank you Regards, Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xm_zzc
emm, eliminating delta files to enhance query performance is quite reasonable
and compaction is a candidate for it. However I have some questions about this, maybe they will help in your design. Q1: A segment with delta files means there are some UD(update/delete) operations on this segment before, which means there will still be some UD in the future. So, is it worth conpacting this segment? Also please keep in mind that UD operations will be blocked if the compaction is going on. Q2: I feel there may be too many kinds of compaction in carbondata... What if in the further I want another compaction that can merge smaller carbondata file into larger ones? Will we add another kind of compaction? I think it's time for us to consider extensibility for the further while proposing this feature. Q3: Currently all kinds of compactions are using the query procedure to rewrite all the records for the related segments. Suppose we have a segment with 100 carbondata files and we only delete one record in this segment. The penalty of rewriting all the records for this segment is heavy. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |