This post was updated on .
Hi,
Currently in carbondata we have datamaps like preaggregate, lucene, bloom, mv and we have lazy and non-lazy methods to load data to datamaps. But lazy load is not allowed for datamaps like preagg, lucene, bloom.but, it is allowed for mv datamap. In lazy load of mv datamap, for every rebuild(load to datamap) we load the complete data of main table and overwrite the existing segment in datamap based on datamap query. This is very costly in terms of performance and we also need to support the lazy and non-lazy load for all the datamaps. This can help in reduce the actual dataload time to main table and whenever user wants, he can do the lazy load for the datamaps present for that table. Basically we need not overwrite the existing data every time we load to datamap, we need to increment the data in new segments similar to main table. This will help to get better performance. Please giveyour inputs or get back for any clarifications. JIRA is created to track https://issues.apache.org/jira/browse/CARBONDATA-3296 Design document is present at https://docs.google.com/document/d/13XgEBUIqaAKdrlQftebr5BNOplL3u9qxuFe-IJUB3cM/edit#heading=h.h311u6t3pve9 Regards, Akash |
Hi Akash, please note that if the index datamap supports lazy build, then there could be a chance that only some segments that have corresponding index data while some do not, which means that carbondata should handle this situation during query.
We cannot just discard the existing index otherwise the query performance will be affected. Hope your design has handle this properly. Thanks. Sent from my Huawei phone |
In reply to this post by akashrn5
Hi Akash,
It is good to have this feature and expecting to get good understanding of the design and solution from the design document. Please clarify the below points. (1) How are we planning support lazy loads? Is there any one-to-one mapping between the segments of the main table and the datamaps? (2) In case if one-to-one mapping to be maintained, lets take pre-aggregate datamaps, even though 'n' number of segments are loaded to the main table, after 'n' loads if the user is trying to create a pre-aggregate datamap, current pre-aggregate implementation creates only one segment, here the one-to-one mapping is broken. Old store is in this way. What is the plan to handle the old store? (3) If the datamap is not updated/inline with the main table loads, the system will fall back to the main table pruning? If so, is this a time consuming process to check for the status of the datamaps for all the queries, if needed to hit datamaps? Hope all these points will be covered in the design document. Thanks, Dhatchayani -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi dhatchayani,
please find the comments below 1. Yes you are right, design document contains this, in datamap status file we will add the mapping of synchrinization of main table and datamap, based on that incremental load is done. 2. I will tell in general, if main table has 10 segments, and then you do lazy load, then first segment of datamap is created with 10segments data, after that 2 loads are done to main table, then for next lazy load, only two segments data of main table are loaded to second segment of datamap. For preaggregate we will keep same behavior as old. 3. Currently also, if you consider the MV datamap, it supports only lazy load. So when the main table data and the datamap data are not synced then the datamao is not selected for pruning, and everytime to decide this, we read the datamap status file and make the plan. So no issues in the reading status. I hope i cleared your doubts, if any more doubts or suggestions, please get back Thank you -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xuchuanyin
Hi xuchuanyin,
For index datamap we can have same behavior as mv datamap only, but it might behave differently in case of lucene. This we can decide whether to enable or lazy load or not. Currently mv behavior is below it supports only lazy load. So when the main table data and the datamap data are not synced then the datamap is not selected for pruning. We read the datamap status file and if datamap is disabled we do not considered during query, as we will change the plan. please get back for any more clarifications or suggestions Thank you -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
“For index datamap we can have same behavior as mv datamap only”
=== Emm, That’s just what I am concerned. In your plan, if the index datamap is lazy, each time after a dataload is completed, this datamap will be ignored before rebuild index is fired for this segment, even though all the index data for the historical segments could be used during the time. In a word, I think this implementation is *UNACCEPTABLE*. Actually I also came across the ‘lazy index datamap’ feature months ago and gave up this idea because I thought this implementation will make this feature useless and no one will try it in the real world. I strongly *recommend* the following implementation: Each time after a data load is finished and before the index data for this segment is generated, if a query is fired, 1. For the historical segments which have index data generated already, carbondata will do pruning using the corresponding index datamap data and return the pruning result A; 2. For the historical segments and the newly generated segment which do not have the index data generated, carbondata will skip pruning using the corresponding index datamap and return the pruning result B; 3. Carbondata will use A union B as the pruning result from the driver side. The above implementation means that carbondata should *support pruning by segment*. |
I agree with you that the index created for old segments will be of no use if
rebuild is not happened and these are not considered in query for pruning. But we go for datamap pruning (index) based on the datamap status. Status will be just enabled or disabled. You cannot maintain status for each segment, there is only one status. So based on this we go for pruning. Currently we have deferred build, which is blocked for index datamaps, preagg, BUT for MV datamap it is always considered as true. Everytime u need to make rebuild after main table laod to enable the mv datamap. So basically if u see, deferred is blocked for index and does not matter for MV as we always have rebuild. So are you telling to make everything as non-lazy? including MV?, this may hit dataload performance. If you want to implement whatever u suggested, need to maintain status for each segment whether for that segment it is enable or not, which i doubt we shouldn't do. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
I think there is still misunderstanding between us.
Here I only concern about the lazy build for index datamap. I think each segment should has its own datamap status and based on this we can support pruning by index datamap for each segment. After this, even the datamap is lazy, during query we can still make use of the index data of the historical segments (which already have index data generated). -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
i got your point. If each segment has status file as i said, we can do
pruning without rebuild also. But need to get others suggestion on this point. So may be we can take up this in another JIRA and track. In this jira we can just suppport incremental data load. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Akash,
There is a difference between index datamap (like bloom) and olap datamaps (like MV). Index datamaps used only for pruning the data while olap datamaps will be used as pre-computed data which can be fetched directly as per query. In OLAP datamap case lazy build or deferred build makes sense as data needs to be always synchronized with master data otherwise we will get stale data. So any difference in synchronization will make the datamap disable. But on the other hand Index datamap is used only for faster pruning so synchronization with master data is not mandatory unless we have a mechanism to prune synchronized data using index datamap and non-synchronized data using default datamap. This is the same @xuchuanyin mentioned. I feel this design is about OLAP datamap incremental loading so better not do any changes in the behaviour of index datamaps. We can consider the improvements of Index datamap in future but it should not be part of it. Please update the design if mentioned anything related to Index datamap. Regards, Ravindra. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
+1 for ravin's advice.
We only support lazy/incremental load/rebuild for olap datamap (MV/preagg), not for index datamap currently. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by ravipesala
hi ravindra,
Got your point. As i had replied to xuchuyain. We can take these index datamap enhancement separately. Thank you -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by akashrn5
Hi akashrn5
Can this feature support the following points, for MV datamap ? 1) lazy or non-lazy mode when using different option 2)reuse the existing MV data and generate new segment like you describe or overwrite MV table directly when using different option 3) merge the MV table segments like main table, automatically and manually Regards, qiuchenjian -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi qiuchenjian,
*1) lazy or non-lazy mode when using different option * Currently MV supports data load through only lazy mode. Incremental data loading will be supported for the same *2)reuse the existing MV data and generate new segment like you describe or overwrite MV table directly when using different option * As per the 1st point, explanation given in above posts on how incremental dataloading will be supported for mv will be followed. Even if we support non-lazy data load mode for MV, overwrite will not happen. *3) merge the MV table segments like main table, automatically and manually* Yes. Both Auto compaction and Manual compaction will be supported for mv datamap -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by akashrn5
1. Will dataloading to MV be supported for all the table properties and load
options provided by Carbon. ? 2. Will the newly proposed "alter datamap datamapname compact" command support all compaction types(auto/major/minor/custom) ? 3. Would load to partition table having MV also be supported ? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hello all,
Please find the updated design document for Incremental dataloading at the below link. https://docs.google.com/document/d/1AACOYmBpwwNdHjJLOub0utSc6JCBMZn8VL5CvZ9hygA/edit?usp=sharing On Thu, Apr 4, 2019 at 12:29 PM chetdb <[hidden email]> wrote: > 1. Will dataloading to MV be supported for all the table properties and > load > options provided by Carbon. ? > 2. Will the newly proposed "alter datamap datamapname compact" command > support all compaction types(auto/major/minor/custom) ? > 3. Would load to partition table having MV also be supported ? > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Free forum by Nabble | Edit this page |