Hi all,
Currently, if the parent(main) table and SI table don’t have the same valid segments then we disable the SI table. And then from the next query onwards, we scan and prune only the parent table until we trigger the next load or REINDEX command (as these commands will make the parent and SI table segments in sync). Because of this, queries take more time to give the result when SI is disabled. To solve this problem we are planning to support SI at the segment level. It means we will not disable SI if the parent and SI table don’t have the same segments, while we will do the pruning on Si for all valid segments, and for the rest of the segments, we will do the pruning on main/parent table. At the time of pruning with the main table in TableIndex.prune, if SI exists for the corresponding filter then all segments which are not present in the SI table will be pruned on the corresponding parent table segment. Please let me know your thought and input about the same. Regards Nihal kumar ojha -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
hi Nihal,
My thoughts as follows. 1. segment level's differences with table level a) pushdown SI into CarbonDataSourceScan/Relation and avoid rewriting the SQL plan b) different segments will have different SI, so different segments maybe choose the different SI 2. data loading/compaction/update/delete/merge a) the main table can update tablestatus metadata entry to success status before SI loading b) if SI is disabled, no need to do SI loading; if SI is enabled, it can do SI loading. 3. query a) reading the data of SI table could be on the executor side; reading the index of SI table could be on the driver side. b) performance: now the system uses a distributed job (groupBy and Join query) to collect the positionIDs of the result rows; if TableIndex.prune use a single thread will have performance issue. c) when the table has multiple SI tables, positionId join of table level shoulde be converted to segment level join. ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
In reply to this post by Nihal
Hi Nihal,
Thanks for bringing this up. It's an important feature to leverage SI at the small segment level also. Already a work is being done on making SI to prune at data map interface, so your design should be aligned with that. So better to check the SI as a data map design first and then make a design for this, then it will be a clear picture to review and start the work, else two designs will contradict each other. Thanks, Regards, Akash R Nilugal -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Nihal
Hi,
Thanks for the input. As already a work is going on to support SI to prune as data map interface (without SQL plan rewrite), This will be handled with help of some carbon property and we are not going to remove the current design (SI support with SQL plan rewrite). So first we are focusing on leveraging SI to segment level with SQL plan rewrite. Please go through this design document <https://docs.google.com/document/d/1q1UIrMO4KGZuBICrixrv4JsbrblATSQVuYY0IAKxWn0/edit> and give your input or suggestion. https://docs.google.com/document/d/1q1UIrMO4KGZuBICrixrv4JsbrblATSQVuYY0IAKxWn0/edit Regards Nihal kumar ojha -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by akashrn5
Hi,
+1 for the feature. This is very important to improve query perf instead of waiting for SI and main table to e always in sync. I have reviewed the doc and given comments, please handle and please discuss with @venu Si as datamap feature to be inline as informed earlier. P.S: This design should be later handled for the SI as datamap flow also, now its just being handled for existing flow. Thanks, Regards, Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Nihal
Hi,
+1 for the feature. It will make the query faster. 1) With design discussion about the feature(SI to prune as a data frame) has one property to set. If the data engine wants to use SI as datamap then need to set. if not set then it will use plan re-write flow. So we have to handle this feature in two cases. Can you please check and update the design as per this? References: SI to prune as a data frame https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing Thanks & Regards Mahesh Raju Somalaraju On Wed, Feb 17, 2021 at 4:05 PM Nihal <[hidden email]> wrote: > Hi all, > > Currently, if the parent(main) table and SI table don’t have the same valid > segments then we disable the SI table. And then from the next query > onwards, > we scan and prune only the parent table until we trigger the next load or > REINDEX command (as these commands will make the parent and SI table > segments in sync). Because of this, queries take more time to give the > result when SI is disabled. > > To solve this problem we are planning to support SI at the segment level. > It > means we will not disable SI if the parent and SI table don’t have the same > segments, while we will do the pruning on Si for all valid segments, and > for > the rest of the segments, we will do the pruning on main/parent table. > > > At the time of pruning with the main table in TableIndex.prune, if SI > exists > for the corresponding filter then all segments which are not present in the > SI table will be pruned on the corresponding parent table segment. > > Please let me know your thought and input about the same. > > Regards > Nihal kumar ojha > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
+1 for this proposal.
But the other ongoing requirement ( http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Presto-Queries-leveraging-Secondary-Index-td105291.html) is dependent on *isSITableEnabled* so, better to wait for it to finish and redesign on top of it. Thanks, Ajantha On Tue, Mar 23, 2021 at 1:03 PM Mahesh Raju Somalaraju < [hidden email]> wrote: > Hi, > > +1 for the feature. > It will make the query faster. > > 1) With design discussion about the feature(SI to prune as a data frame) > has one property to set. > If the data engine wants to use SI as datamap then need to set. if not > set then it will use plan re-write flow. > > So we have to handle this feature in two cases. Can you please check and > update the design as per this? > > References: > SI to prune as a data frame > > https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing > > Thanks & Regards > Mahesh Raju Somalaraju > > On Wed, Feb 17, 2021 at 4:05 PM Nihal <[hidden email]> wrote: > > > Hi all, > > > > Currently, if the parent(main) table and SI table don’t have the same > valid > > segments then we disable the SI table. And then from the next query > > onwards, > > we scan and prune only the parent table until we trigger the next load or > > REINDEX command (as these commands will make the parent and SI table > > segments in sync). Because of this, queries take more time to give the > > result when SI is disabled. > > > > To solve this problem we are planning to support SI at the segment level. > > It > > means we will not disable SI if the parent and SI table don’t have the > same > > segments, while we will do the pruning on Si for all valid segments, and > > for > > the rest of the segments, we will do the pruning on main/parent table. > > > > > > At the time of pruning with the main table in TableIndex.prune, if SI > > exists > > for the corresponding filter then all segments which are not present in > the > > SI table will be pruned on the corresponding parent table segment. > > > > Please let me know your thought and input about the same. > > > > Regards > > Nihal kumar ojha > > > > > > > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > |
+1 on this.
Agree with Ajantha on this. Vikram Ahuja |
Hi All,
Thanks for your input and suggestion. For now, we will support leveraging SI to segment level only with SQL plan rewrite(already mentioned in this thread and design document). As a parallel work is going on to support SI as datamap(without plan rewrite), which will be at table level. This work is independent of the existing property "isSITableEnabled" as mentioned in the design doc or PR 4110 <https://github.com/apache/carbondata/pull/4110> . Also, there is no other major conflict or dependency between both designs. So we can safely handle both the work parallelly. We are planning to leverage the datamap SI to the segment level later(once the PR merged). I will create a separate JIRA ticket to track this work. Regards Nihal kumar ojha -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |