Apache CarbonData Dev Mailing List archive

[Discussion]Do we still need to support carbon.merge.index.in.segment property ?

Classic

List

Threaded

10 messages Options

VenuReddy

[Discussion]Do we still need to support carbon.merge.index.in.segment property ?

Dear Community.!

Have recently encountered a problem of Segment directory and segment file in
metadata directory are not created for partitioned table when
'carbon.merge.index.in.segment' property is set to 'false'. And actual index
files which were present in respective partition's '.tmp' directory are also
deleted without moving them out to respective partition directory where its
'.carbondata' file exist. Thus queries throw exception while reading index
files. Please refer jira issue -
https://issues.apache.org/jira/browse/CARBONDATA-3834
<https://issues.apache.org/jira/browse/CARBONDATA-3834>

To address this issue, we have 2 options to go with -
1. Either fix it to work for 'carbon.merge.index.in.segment' set to false
case. There is an open PR https://github.com/apache/carbondata/pull/3776
<https://github.com/apache/carbondata/pull/3776> for it.

2. Or Deprecate the 'carbon.merge.index.in.segment' property itself. As the
query performance is better when merge index files are in use when compared
to normal index files, and default behavior is to generate merge index
files, probably it is not necessary to support
'carbon.merge.index.in.segment' anymore.

What do you think about it ? Please let me know your opinion.

Thanks,
Venu

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

David CaiQiang

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

Better to always merge index.

-1 for 1,

+1 for 2,

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

Ajantha Bhat

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

Hi, What if too many index files in a segment and user want to finish load
fast and don't want to wait for merge index?

That time setting merge index = false can help to save load time and in off
peak time user can create merge index.

So I still feel we need to fix issue exist when merge index = false.

Thanks,
Ajantha

On Thu, 9 Jul, 2020, 3:05 pm David CaiQiang, <[hidden email]> wrote:

> Better to always merge index.
>
> -1 for 1,
>
> +1 for 2,
>
>
>
> -----
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

kunalkapoor

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

Agree with david on always using merge index.

+1 for deprecating the property.

On Thu, Jul 9, 2020 at 3:32 PM Ajantha Bhat <[hidden email]> wrote:

> Hi, What if too many index files in a segment and user want to finish load
> fast and don't want to wait for merge index?
>
> That time setting merge index = false can help to save load time and in off
> peak time user can create merge index.
>
> So I still feel we need to fix issue exist when merge index = false.
>
> Thanks,
> Ajantha
>
> On Thu, 9 Jul, 2020, 3:05 pm David CaiQiang, <[hidden email]> wrote:
>
> > Better to always merge index.
> >
> > -1 for 1,
> >
> > +1 for 2,
> >
> >
> >
> > -----
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

kumarvishal09

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

+1 for deprecating this property. We can have it as an internal property
for dev, not expose to user.

+1 for fixing the issue. If merge index creation is failing, then also
query should work fine with index file.

-Regards
Kumar Vishal

On Thu, 9 Jul 2020 at 10:58 PM, Kunal Kapoor <[hidden email]>
wrote:

> Agree with david on always using merge index.
>
> +1 for deprecating the property.
>
> On Thu, Jul 9, 2020 at 3:32 PM Ajantha Bhat <[hidden email]> wrote:
>
> > Hi, What if too many index files in a segment and user want to finish
> load
> > fast and don't want to wait for merge index?
> >
> > That time setting merge index = false can help to save load time and in
> off
> > peak time user can create merge index.
> >
> > So I still feel we need to fix issue exist when merge index = false.
> >
> > Thanks,
> > Ajantha
> >
> > On Thu, 9 Jul, 2020, 3:05 pm David CaiQiang, <[hidden email]>
> wrote:
> >
> > > Better to always merge index.
> > >
> > > -1 for 1,
> > >
> > > +1 for 2,
> > >
> > >
> > >
> > > -----
> > > Best Regards
> > > David Cai
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>

kumar vishal

akashrn5

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

In reply to this post by VenuReddy

Hi,

+1, we can deprecate it and as Vishal suggested we can keep as internal
property for developer purpose.

Regards,
Akash R Nilugal

On Thu, Jul 9, 2020, 2:46 PM VenuReddy <[hidden email]> wrote:

> Dear Community.!
>
> Have recently encountered a problem of Segment directory and segment file
> in
> metadata directory are not created for partitioned table when
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual
> index
> files which were present in respective partition's '.tmp' directory are
> also
> deleted without moving them out to respective partition directory where its
> '.carbondata' file exist. Thus queries throw exception while reading index
> files. Please refer jira issue -
> https://issues.apache.org/jira/browse/CARBONDATA-3834
> <https://issues.apache.org/jira/browse/CARBONDATA-3834>
>
> To address this issue, we have 2 options to go with -
> 1. Either fix it to work for 'carbon.merge.index.in.segment' set to false
> case. There is an open PR https://github.com/apache/carbondata/pull/3776
> <https://github.com/apache/carbondata/pull/3776> for it.
>
> 2. Or Deprecate the 'carbon.merge.index.in.segment' property itself. As
> the
> query performance is better when merge index files are in use when compared
> to normal index files, and default behavior is to generate merge index
> files, probably it is not necessary to support
> 'carbon.merge.index.in.segment' anymore.
>
> What do you think about it ? Please let me know your opinion.
>
> Thanks,
> Venu
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Ajantha Bhat

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

Hi,
I didn't reply to deprecation. *+1 for deprecating it*.

*And +1 for issue fix also.*
Issue fix, I didn't mean when *carbon.merge.index.in
<http://carbon.merge.index.in/>.segment = false.*

but when when *carbon.merge.index.in
<http://carbon.merge.index.in/>.segment = true and merge index creation
failed for some reason.*
code needs to take care of
a. Moving index files from temp folder to final folder in case of partition
table load.
b. Not failing the current partition load. (same as normal load behavior)
I think these two are not handled after partition optimization, you can
check and handle it.

Thanks,
Ajantha

On Thu, 9 Jul, 2020, 9:29 pm Akash r, <[hidden email]> wrote:

> Hi,
>
> +1, we can deprecate it and as Vishal suggested we can keep as internal
> property for developer purpose.
>
> Regards,
> Akash R Nilugal
>
> On Thu, Jul 9, 2020, 2:46 PM VenuReddy <[hidden email]> wrote:
>
> > Dear Community.!
> >
> > Have recently encountered a problem of Segment directory and segment file
> > in
> > metadata directory are not created for partitioned table when
> > 'carbon.merge.index.in.segment' property is set to 'false'. And actual
> > index
> > files which were present in respective partition's '.tmp' directory are
> > also
> > deleted without moving them out to respective partition directory where
> its
> > '.carbondata' file exist. Thus queries throw exception while reading
> index
> > files. Please refer jira issue -
> > https://issues.apache.org/jira/browse/CARBONDATA-3834
> > <https://issues.apache.org/jira/browse/CARBONDATA-3834>
> >
> > To address this issue, we have 2 options to go with -
> > 1. Either fix it to work for 'carbon.merge.index.in.segment' set to
> false
> > case. There is an open PR
> https://github.com/apache/carbondata/pull/3776
> > <https://github.com/apache/carbondata/pull/3776> for it.
> >
> > 2. Or Deprecate the 'carbon.merge.index.in.segment' property itself. As
> > the
> > query performance is better when merge index files are in use when
> compared
> > to normal index files, and default behavior is to generate merge index
> > files, probably it is not necessary to support
> > 'carbon.merge.index.in.segment' anymore.
> >
> > What do you think about it ? Please let me know your opinion.
> >
> > Thanks,
> > Venu
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>
>

ravipesala

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

Hi,

+1
I agree with Vishal, let's deprecate the configuration and keep it as
internal.

Regards,
Ravindra.

On Fri, 10 Jul 2020 at 01:54, Ajantha Bhat <[hidden email]> wrote:

> Hi,
> I didn't reply to deprecation. *+1 for deprecating it*.
>
> *And +1 for issue fix also.*
> Issue fix, I didn't mean when *carbon.merge.index.in
> <http://carbon.merge.index.in/>.segment = false.*
>
> but when when *carbon.merge.index.in
> <http://carbon.merge.index.in/>.segment = true and merge index creation
> failed for some reason.*
> code needs to take care of
> a. Moving index files from temp folder to final folder in case of partition
> table load.
> b. Not failing the current partition load. (same as normal load behavior)
> I think these two are not handled after partition optimization, you can
> check and handle it.
>
>
> Thanks,
> Ajantha
>
> On Thu, 9 Jul, 2020, 9:29 pm Akash r, <[hidden email]> wrote:
>
> > Hi,
> >
> > +1, we can deprecate it and as Vishal suggested we can keep as internal
> > property for developer purpose.
> >
> > Regards,
> > Akash R Nilugal
> >
> > On Thu, Jul 9, 2020, 2:46 PM VenuReddy <[hidden email]>
> wrote:
> >
> > > Dear Community.!
> > >
> > > Have recently encountered a problem of Segment directory and segment
> file
> > > in
> > > metadata directory are not created for partitioned table when
> > > 'carbon.merge.index.in.segment' property is set to 'false'. And actual
> > > index
> > > files which were present in respective partition's '.tmp' directory are
> > > also
> > > deleted without moving them out to respective partition directory where
> > its
> > > '.carbondata' file exist. Thus queries throw exception while reading
> > index
> > > files. Please refer jira issue -
> > > https://issues.apache.org/jira/browse/CARBONDATA-3834
> > > <https://issues.apache.org/jira/browse/CARBONDATA-3834>
> > >
> > > To address this issue, we have 2 options to go with -
> > > 1. Either fix it to work for 'carbon.merge.index.in.segment' set to
> > false
> > > case. There is an open PR
> > https://github.com/apache/carbondata/pull/3776
> > > <https://github.com/apache/carbondata/pull/3776> for it.
> > >
> > > 2. Or Deprecate the 'carbon.merge.index.in.segment' property itself.
> As
> > > the
> > > query performance is better when merge index files are in use when
> > compared
> > > to normal index files, and default behavior is to generate merge index
> > > files, probably it is not necessary to support
> > > 'carbon.merge.index.in.segment' anymore.
> > >
> > > What do you think about it ? Please let me know your opinion.
> > >
> > > Thanks,
> > > Venu
> > >
> > >
> > >
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
> >
>

--
Thanks & Regards,
Ravi

David CaiQiang

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

In reply to this post by kumarvishal09

The merging index should be a part of loading. It is not good to extract the
merging index to an independent process, it brought the query issue (the
system can't find the index files when/after merging).

In my opinion, during loading, new .carbonindex files should be temporary,
we should merge them to a .carbonindexmerge file in a segment before
updating the segment status to success in tablestatus file.
When the merging index failed, loading should be failed.

for query:
1. support reading .carbonindex files and .carbonindexmerge files

for loading: (also include the loading part of compaction/create
index/create mv/merge operations)
better to do like this.
step 1. update tablestatus file to add an in-progress segment
step 2. generate carbondata file and temporary .carbonindex files, for a
partitioned table, it also generates a temporary segment file for each
related partition.
step 3. merge .carbonindex files to a .carbonindexmerge file.
step 4. write a segment file. for a partitioned table, merge all temporary
segment files to one segment file.
step 5. update tablestatus file with final status, segment file name and
some statistics.

So in total,
update tablestatus file twice,
write segment file once,
write .carbonindex files once,
write and delete .carbonindex files once.

for updating:
1. Now only updating operation can keep .carbonindex file
in the future, maybe we can change updating operations to the same with
merge operation to generate new files into a new segment.

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

David CaiQiang

Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

update reply:

The merging index should be a part of loading. It is not good to extract the
merging index to an independent process, it brought the query issue (the
system can't find the index files when/after merging).

In my opinion, during loading, new .carbonindex files should be temporary,
we should merge them to a .carbonindexmerge file in a segment before
updating the segment status to success in tablestatus file.
When the merging index failed, loading should be failed.

for query:
1. support reading .carbonindex files and .carbonindexmerge files

for loading: (also include the loading part of compaction/create
index/create mv/merge operations)
better to do like this.
step 1. update tablestatus file to add an in-progress segment
step 2. generate carbondata files and temporary .carbonindex files.
step 3. merge .carbonindex files to a .carbonindexmerge file.
step 4. write a segment file.
step 5. update tablestatus file with final status, segment file name and
some statistics.

So in total,
update tablestatus file twice,
write segment file once,
write .carbonindexmerge files once,
write and delete .carbonindex files once.

for updating:
1. Now only updating operation can keep .carbonindex file
in the future, maybe we can change updating operations to the same with
merge operations to generate new files into a new segment.

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai