Apache CarbonData Dev Mailing List archive

[Discussion] Merging carbonindex files for each segments and across segments

Classic

List

Threaded

11 messages Options

ravipesala

[Discussion] Merging carbonindex files for each segments and across segments

Hi,

Problem :
The first-time query of carbon becomes very slow. It is because of reading
many small carbonindex files and cache to the driver at the first time.
Many carbonindex files are created in two cases
Case 1: Loading data in large cluster
For example, if the cluster size is 100 nodes then for each load 100
index files are created per segment. So after 100 loads, the number of
carbonindex files becomes 10000.
Case 2: Frequent loads
For example, if the load happens for every 5 minutes in 4 node cluster,
it will be more than 10000 index files after 10 days even in 4 node cluster.

It will be slower to read all the files from the driver since a lot of
namenode calls and IO operations.

Solution :
Merge the carbonindex files in two levels.so that we can reduce the IO
calls to namenode and improves the read performance.

Level 1: Merge within a segment.
Merge the carbonindex files to single file immediately after load completes
within the segment. It would be named as a .carbonindexmerge file. It is
actually not a true data merging but a simple file merge. So that the
current structure of carbonindex files does not change. While reading we
just read one file instead of many carbonindex files within the segment.

Level 2: Merge across segments.
Merge the already merged carbonindex files of each segment would be merged
after a configurable number of segments reached. These files are placed
under the metadata folder of the table.And the information of these merged
carbonindex files will be updated in the table status file. While reading
the carbonindex files first we check the tablestatus for the availability
of the merged file and read using the information available in it.
For example, the configurable number to merge index files across segments
are 100 then for every 100 segments one new merged index file will be
created under metadata folder and the tablestatus of these 100 segments are
updated with the information of this file.
This file is not updatable and it would be removed only if all the segments
of this merged index file is removed. This file also a simple file merge
not an actual data merge. By default this is disabled and the user can
enable it from the carbon properties.

And also there is an issue in driver cache for old segments.It would be not
necessary to cache the old segments if the queries are not interested in
them.I will start another discussion for this cache issue.

--
Thanks & Regards
Ravindra

Liang Chen-2

Re: [Discussion] Merging carbonindex files for each segments and across segments

+1 for this proposal and solution, thanks, Ravi

Regards
Liang

2017-10-20 19:13 GMT+05:30 Ravindra Pesala <[hidden email]>:

> Hi,
>
> Problem :
> The first-time query of carbon becomes very slow. It is because of reading
> many small carbonindex files and cache to the driver at the first time.
> Many carbonindex files are created in two cases
> Case 1: Loading data in large cluster
> For example, if the cluster size is 100 nodes then for each load 100
> index files are created per segment. So after 100 loads, the number of
> carbonindex files becomes 10000.
> Case 2: Frequent loads
> For example, if the load happens for every 5 minutes in 4 node cluster,
> it will be more than 10000 index files after 10 days even in 4 node
> cluster.
>
> It will be slower to read all the files from the driver since a lot of
> namenode calls and IO operations.
>
> Solution :
> Merge the carbonindex files in two levels.so that we can reduce the IO
> calls to namenode and improves the read performance.
>
> Level 1: Merge within a segment.
> Merge the carbonindex files to single file immediately after load completes
> within the segment. It would be named as a .carbonindexmerge file. It is
> actually not a true data merging but a simple file merge. So that the
> current structure of carbonindex files does not change. While reading we
> just read one file instead of many carbonindex files within the segment.
>
> Level 2: Merge across segments.
> Merge the already merged carbonindex files of each segment would be merged
> after a configurable number of segments reached. These files are placed
> under the metadata folder of the table.And the information of these merged
> carbonindex files will be updated in the table status file. While reading
> the carbonindex files first we check the tablestatus for the availability
> of the merged file and read using the information available in it.
> For example, the configurable number to merge index files across segments
> are 100 then for every 100 segments one new merged index file will be
> created under metadata folder and the tablestatus of these 100 segments are
> updated with the information of this file.
> This file is not updatable and it would be removed only if all the segments
> of this merged index file is removed. This file also a simple file merge
> not an actual data merge. By default this is disabled and the user can
> enable it from the carbon properties.
>
> And also there is an issue in driver cache for old segments.It would be not
> necessary to cache the old segments if the queries are not interested in
> them.I will start another discussion for this cache issue.
>
> --
> Thanks & Regards
> Ravindra
>

Jacky Li

Re: [Discussion] Merging carbonindex files for each segments and across segments

In reply to this post by ravipesala

Hi Ravindra,

I doubt whether Level 2 merge is required, if the intention is to solve problem of case 2, user can perform data compaction, so that both data and index will be merged using level 1 merge. So it can avoid both small data file and small index file, right?

Regards,
Jacky Li

> 在 2017年10月20日，下午9:43，Ravindra Pesala <[hidden email]> 写道：
>
> Hi,
>
> Problem :
> The first-time query of carbon becomes very slow. It is because of reading
> many small carbonindex files and cache to the driver at the first time.
> Many carbonindex files are created in two cases
> Case 1: Loading data in large cluster
> For example, if the cluster size is 100 nodes then for each load 100
> index files are created per segment. So after 100 loads, the number of
> carbonindex files becomes 10000.
> Case 2: Frequent loads
> For example, if the load happens for every 5 minutes in 4 node cluster,
> it will be more than 10000 index files after 10 days even in 4 node cluster.
>
> It will be slower to read all the files from the driver since a lot of
> namenode calls and IO operations.
>
> Solution :
> Merge the carbonindex files in two levels.so that we can reduce the IO
> calls to namenode and improves the read performance.
>
> Level 1: Merge within a segment.
> Merge the carbonindex files to single file immediately after load completes
> within the segment. It would be named as a .carbonindexmerge file. It is
> actually not a true data merging but a simple file merge. So that the
> current structure of carbonindex files does not change. While reading we
> just read one file instead of many carbonindex files within the segment.
>
> Level 2: Merge across segments.
> Merge the already merged carbonindex files of each segment would be merged
> after a configurable number of segments reached. These files are placed
> under the metadata folder of the table.And the information of these merged
> carbonindex files will be updated in the table status file. While reading
> the carbonindex files first we check the tablestatus for the availability
> of the merged file and read using the information available in it.
> For example, the configurable number to merge index files across segments
> are 100 then for every 100 segments one new merged index file will be
> created under metadata folder and the tablestatus of these 100 segments are
> updated with the information of this file.
> This file is not updatable and it would be removed only if all the segments
> of this merged index file is removed. This file also a simple file merge
> not an actual data merge. By default this is disabled and the user can
> enable it from the carbon properties.
>
> And also there is an issue in driver cache for old segments.It would be not
> necessary to cache the old segments if the queries are not interested in
> them.I will start another discussion for this cache issue.
>
> --
> Thanks & Regards
> Ravindra

yaojinguo

Re: [Discussion] Merging carbonindex files for each segments and across segments

In reply to this post by ravipesala

If we already have many carbonindex files in cluster, how to merge them, any
tool or command will be available ? or we need to reload the data.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

cenyuhai11

回复： [Discussion] Merging carbonindex files for each segments and across segments

In reply to this post by Jacky Li

A very good feature! I think case 1 and case 2 can be handle
We can merge data files and index files after we insert into hdfs automaticlly.
In case 1:
if the data files are not small, there will be 100 data files and 1 index file.
if the data files are small, there will be (dataSize/sizePerFile) data files and 1 index file.

When you start to develop this feature, I need this...

Best regards!
Yuhai Cen

在2017年10月21日 13:02，Jacky Li<[hidden email]> 写道：
Hi Ravindra,

I doubt whether Level 2 merge is required, if the intention is to solve problem of case 2, user can perform data compaction, so that both data and index will be merged using level 1 merge. So it can avoid both small data file and small index file, right?

Regards,
Jacky Li

ravipesala

Re: [Discussion] Merging carbonindex files for each segments and across segments

Hi ,

@Jacky I feel level 2 merging also required as level 1 does not resolve
the problem completely. And yes compaction might solve the issue but in
some use cases users do not compact at all.

@yaojinguo If the table already has many index files then new load after
the upgrade will generate level 2 files across segments.

@cenyuhai11 we start developing this feature very soon and it will be
delivered in next carbon version.

Regards,
Ravindra.

On 21 October 2017 at 17:39, 岑玉海 <[hidden email]> wrote:

> A very good feature! I think case 1 and case 2 can be handle
> We can merge data files and index files after we insert into hdfs
> automaticlly.
> In case 1:
> if the data files are not small, there will be 100 data files and 1 index
> file.
> if the data files are small, there will be (dataSize/sizePerFile) data
> files and 1 index file.
>
>
> When you start to develop this feature, I need this...
>
>
> Best regards!
> Yuhai Cen
>
>
> 在2017年10月21日 13:02，Jacky Li<[hidden email]> 写道：
> Hi Ravindra,
>
> I doubt whether Level 2 merge is required, if the intention is to solve
> problem of case 2, user can perform data compaction, so that both data and
> index will be merged using level 1 merge. So it can avoid both small data
> file and small index file, right?
>
> Regards,
> Jacky Li
>
> > 在 2017年10月20日，下午9:43，Ravindra Pesala <[hidden email]> 写道：
> >
> > Hi,
> >
> > Problem :
> > The first-time query of carbon becomes very slow. It is because of
> reading
> > many small carbonindex files and cache to the driver at the first time.
> > Many carbonindex files are created in two cases
> > Case 1: Loading data in large cluster
> > For example, if the cluster size is 100 nodes then for each load 100
> > index files are created per segment. So after 100 loads, the number of
> > carbonindex files becomes 10000.
> > Case 2: Frequent loads
> > For example, if the load happens for every 5 minutes in 4 node cluster,
> > it will be more than 10000 index files after 10 days even in 4 node
> cluster.
> >
> > It will be slower to read all the files from the driver since a lot of
> > namenode calls and IO operations.
> >
> > Solution :
> > Merge the carbonindex files in two levels.so that we can reduce the IO
> > calls to namenode and improves the read performance.
> >
> > Level 1: Merge within a segment.
> > Merge the carbonindex files to single file immediately after load
> completes
> > within the segment. It would be named as a .carbonindexmerge file. It is
> > actually not a true data merging but a simple file merge. So that the
> > current structure of carbonindex files does not change. While reading we
> > just read one file instead of many carbonindex files within the segment.
> >
> > Level 2: Merge across segments.
> > Merge the already merged carbonindex files of each segment would be
> merged
> > after a configurable number of segments reached. These files are placed
> > under the metadata folder of the table.And the information of these
> merged
> > carbonindex files will be updated in the table status file. While reading
> > the carbonindex files first we check the tablestatus for the availability
> > of the merged file and read using the information available in it.
> > For example, the configurable number to merge index files across segments
> > are 100 then for every 100 segments one new merged index file will be
> > created under metadata folder and the tablestatus of these 100 segments
> are
> > updated with the information of this file.
> > This file is not updatable and it would be removed only if all the
> segments
> > of this merged index file is removed. This file also a simple file merge
> > not an actual data merge. By default this is disabled and the user can
> > enable it from the carbon properties.
> >
> > And also there is an issue in driver cache for old segments.It would be
> not
> > necessary to cache the old segments if the queries are not interested in
> > them.I will start another discussion for this cache issue.
> >
> > --
> > Thanks & Regards
> > Ravindra
>
>
>
>

--
Thanks & Regards,
Ravi

Jin Zhou

Re: [Discussion] Merging carbonindex files for each segments and across segments

In reply to this post by ravipesala

Hi, ravipesala

Thank you for your proposal, merging index file is a very useful feature as
we have already met serious performance issue caused by too many index files
(case 1).

But I think the core problem of case 2 is too many loads which should be
mainly considered in segment compaction part. As "one segment one index
file" design seems more clear and simple.

Regards,
Jin Zhou

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Liang Chen

Re: [Discussion] Merging carbonindex files for each segments and across segments

Administrator

Yes, Jin Zhou.
Merge all index files to one in a segment would be useful feature. it would
significantly improve query performance.

Regards
Liang

Jin Zhou wrote

> Hi, ravipesala
>
> Thank you for your proposal, merging index file is a very useful feature
> as
> we have already met serious performance issue caused by too many index
> files
> (case 1).
>
> But I think the core problem of case 2 is too many loads which should be
> mainly considered in segment compaction part. As "one segment one index
> file" design seems more clear and simple.
>
> Regards,
> Jin Zhou
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

dhatchayani

Re: [Discussion] Merging carbonindex files for each segments and across segments

Hi Dev,

Currently, Merge index feature is not complete and stable. It has some gaps
also, for some of the features like pre-aggregate and streaming, merge index
was not supported when it was implemented. We were not able to stabilize and
use this feature then.

With this discussion, will again work on the Merge index feature as it can
improve the performance to a greater extent.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xm_zzc

Re: [Discussion] Merging carbonindex files for each segments and across segments

Sounds good. Any plan on this feature ? Will this feature be released with
Carbon 1.5 or 1.4.1?

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

dhatchayani

Re: [Discussion] Merging carbonindex files for each segments and across segments

This feature will be released in 1.4.1

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/