Apache CarbonData Dev Mailing List archive

Support SI at Segment level

Classic

List

Threaded

10 messages Options

Nihal

Feb 17, 2021; 10:35am

Support SI at Segment level

Hi all,

Currently, if the parent(main) table and SI table don’t have the same valid
segments then we disable the SI table. And then from the next query onwards,
we scan and prune only the parent table until we trigger the next load or
REINDEX command (as these commands will make the parent and SI table
segments in sync). Because of this, queries take more time to give the
result when SI is disabled.

To solve this problem we are planning to support SI at the segment level. It
means we will not disable SI if the parent and SI table don’t have the same
segments, while we will do the pruning on Si for all valid segments, and for
the rest of the segments, we will do the pruning on main/parent table.

At the time of pruning with the main table in TableIndex.prune, if SI exists
for the corresponding filter then all segments which are not present in the
SI table will be pruned on the corresponding parent table segment.

Please let me know your thought and input about the same.

Regards
Nihal kumar ojha

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

David CaiQiang

Feb 19, 2021; 2:45am

Re: Support SI at Segment level

hi Nihal,
My thoughts as follows.
1. segment level's differences with table level
a) pushdown SI into CarbonDataSourceScan/Relation and avoid rewriting the
SQL plan
b) different segments will have different SI, so different segments maybe
choose the different SI

2. data loading/compaction/update/delete/merge
a) the main table can update tablestatus metadata entry to success status
before SI loading
b) if SI is disabled, no need to do SI loading; if SI is enabled, it can
do SI loading.

3. query
a) reading the data of SI table could be on the executor side; reading the
index of SI table could be on the driver side.
b) performance: now the system uses a distributed job (groupBy and Join
query) to collect the positionIDs of the result rows; if TableIndex.prune
use a single thread will have performance issue.
c) when the table has multiple SI tables, positionId join of table level
shoulde be converted to segment level join.

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

akashrn5

Feb 24, 2021; 7:43am

Re: Support SI at Segment level

In reply to this post by Nihal

Hi Nihal,

Thanks for bringing this up. It's an important feature to leverage SI at the
small segment level also.

Already a work is being done on making SI to prune at data map interface, so
your design should be aligned with that.
So better to check the SI as a data map design first and then make a design
for this, then it will be a clear picture to review and start the work, else
two designs will contradict each other.

Thanks,

Regards,
Akash R Nilugal

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Nihal

Mar 04, 2021; 7:07am

Re: Support SI at Segment level

In reply to this post by Nihal

Hi,
Thanks for the input.

As already a work is going on to support SI to prune as data map
interface (without SQL plan rewrite), This will be handled with help of some
carbon property and we are not going to remove the current design (SI
support with SQL plan rewrite).

So first we are focusing on leveraging SI to segment level with SQL plan
rewrite. Please go through this design document
<https://docs.google.com/document/d/1q1UIrMO4KGZuBICrixrv4JsbrblATSQVuYY0IAKxWn0/edit>
and give your input or suggestion.

https://docs.google.com/document/d/1q1UIrMO4KGZuBICrixrv4JsbrblATSQVuYY0IAKxWn0/edit

Regards
Nihal kumar ojha

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

akashrn5

Mar 23, 2021; 5:28am

Re: Support SI at Segment level

In reply to this post by akashrn5

Hi,

+1 for the feature. This is very important to improve query perf instead of
waiting for SI and main table to e always in sync.

I have reviewed the doc and given comments, please handle and please discuss
with @venu Si as datamap feature to be inline as informed earlier.

P.S: This design should be later handled for the SI as datamap flow also,
now its just being handled for existing flow.

Thanks,

Regards,
Akash R

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

maheshrajus

Mar 23, 2021; 7:32am

Re: Support SI at Segment level

In reply to this post by Nihal

Hi,

+1 for the feature.
It will make the query faster.

1) With design discussion about the feature(SI to prune as a data frame)
has one property to set.
If the data engine wants to use SI as datamap then need to set. if not
set then it will use plan re-write flow.

So we have to handle this feature in two cases. Can you please check and
update the design as per this?

References:
SI to prune as a data frame
https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing

Thanks & Regards
Mahesh Raju Somalaraju

On Wed, Feb 17, 2021 at 4:05 PM Nihal <[hidden email]> wrote:

> Hi all,
>
> Currently, if the parent(main) table and SI table don’t have the same valid
> segments then we disable the SI table. And then from the next query
> onwards,
> we scan and prune only the parent table until we trigger the next load or
> REINDEX command (as these commands will make the parent and SI table
> segments in sync). Because of this, queries take more time to give the
> result when SI is disabled.
>
> To solve this problem we are planning to support SI at the segment level.
> It
> means we will not disable SI if the parent and SI table don’t have the same
> segments, while we will do the pruning on Si for all valid segments, and
> for
> the rest of the segments, we will do the pruning on main/parent table.
>
>
> At the time of pruning with the main table in TableIndex.prune, if SI
> exists
> for the corresponding filter then all segments which are not present in the
> SI table will be pruned on the corresponding parent table segment.
>
> Please let me know your thought and input about the same.
>
> Regards
> Nihal kumar ojha
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

... [show rest of quote]

Ajantha Bhat

Mar 31, 2021; 4:47am

Re: Support SI at Segment level

+1 for this proposal.

But the other ongoing requirement (
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Presto-Queries-leveraging-Secondary-Index-td105291.html)
is dependent on *isSITableEnabled*
so, better to wait for it to finish and redesign on top of it.

Thanks,
Ajantha

On Tue, Mar 23, 2021 at 1:03 PM Mahesh Raju Somalaraju <
[hidden email]> wrote:

> Hi,
>
> +1 for the feature.
> It will make the query faster.
>
> 1) With design discussion about the feature(SI to prune as a data frame)
> has one property to set.
> If the data engine wants to use SI as datamap then need to set. if not
> set then it will use plan re-write flow.
>
> So we have to handle this feature in two cases. Can you please check and
> update the design as per this?
>
> References:
> SI to prune as a data frame
>
> https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing
>
> Thanks & Regards
> Mahesh Raju Somalaraju
>
> On Wed, Feb 17, 2021 at 4:05 PM Nihal <[hidden email]> wrote:
>
> > Hi all,
> >
> > Currently, if the parent(main) table and SI table don’t have the same
> valid
> > segments then we disable the SI table. And then from the next query
> > onwards,
> > we scan and prune only the parent table until we trigger the next load or
> > REINDEX command (as these commands will make the parent and SI table
> > segments in sync). Because of this, queries take more time to give the
> > result when SI is disabled.
> >
> > To solve this problem we are planning to support SI at the segment level.
> > It
> > means we will not disable SI if the parent and SI table don’t have the
> same
> > segments, while we will do the pruning on Si for all valid segments, and
> > for
> > the rest of the segments, we will do the pruning on main/parent table.
> >
> >
> > At the time of pruning with the main table in TableIndex.prune, if SI
> > exists
> > for the corresponding filter then all segments which are not present in
> the
> > SI table will be pruned on the corresponding parent table segment.
> >
> > Please let me know your thought and input about the same.
> >
> > Regards
> > Nihal kumar ojha
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

... [show rest of quote]

vikramahuja1001

Mar 31, 2021; 7:50am

Re: Support SI at Segment level

+1 on this.
Agree with Ajantha on this.

Vikram Ahuja

Nihal

Apr 05, 2021; 8:45am

Re: Support SI at Segment level

Hi All,
Thanks for your input and suggestion.

For now, we will support leveraging SI to segment level only with SQL
plan rewrite(already mentioned in this thread and design document).

As a parallel work is going on to support SI as datamap(without plan
rewrite), which will be at table level.
This work is independent of the existing property "isSITableEnabled"
as mentioned in the design doc or PR 4110
<https://github.com/apache/carbondata/pull/4110> .
Also, there is no other major conflict or dependency between both designs.
So we can safely handle both the work parallelly.

We are planning to leverage the datamap SI to the segment level
later(once the PR merged). I will create a separate JIRA ticket to track
this work.

Regards
Nihal kumar ojha

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Pepperon92

Jun 28, 2025; 8:39pm

Re: Support SI at Segment level

In reply to this post by Nihal

Hi there,

That’s a great observation. When the SI (secondary index) table is out of sync with the parent table, disabling it is a safety measure—but yes, it definitely impacts query performance due to the lack of indexing benefits. Until a `REINDEX` or reload happens to bring both tables back into alignment, the system defaults to scanning the full parent table, which slows things down considerably. One option to mitigate the impact is to monitor segment validity more proactively and schedule reindexing after known changes.

Bei der Optimierung von Abfragen ist Timing alles – genau wie bei der Wahl eines passenden frühlingskleid, das Funktionalität und Stil im richtigen Moment perfekt kombiniert.