Hi,
Problem : The first-time query of carbon becomes very slow. It is because of reading many small carbonindex files and cache to the driver at the first time. Many carbonindex files are created in two cases Case 1: Loading data in large cluster For example, if the cluster size is 100 nodes then for each load 100 index files are created per segment. So after 100 loads, the number of carbonindex files becomes 10000. Case 2: Frequent loads For example, if the load happens for every 5 minutes in 4 node cluster, it will be more than 10000 index files after 10 days even in 4 node cluster. It will be slower to read all the files from the driver since a lot of namenode calls and IO operations. Solution : Merge the carbonindex files in two levels.so that we can reduce the IO calls to namenode and improves the read performance. Level 1: Merge within a segment. Merge the carbonindex files to single file immediately after load completes within the segment. It would be named as a .carbonindexmerge file. It is actually not a true data merging but a simple file merge. So that the current structure of carbonindex files does not change. While reading we just read one file instead of many carbonindex files within the segment. Level 2: Merge across segments. Merge the already merged carbonindex files of each segment would be merged after a configurable number of segments reached. These files are placed under the metadata folder of the table.And the information of these merged carbonindex files will be updated in the table status file. While reading the carbonindex files first we check the tablestatus for the availability of the merged file and read using the information available in it. For example, the configurable number to merge index files across segments are 100 then for every 100 segments one new merged index file will be created under metadata folder and the tablestatus of these 100 segments are updated with the information of this file. This file is not updatable and it would be removed only if all the segments of this merged index file is removed. This file also a simple file merge not an actual data merge. By default this is disabled and the user can enable it from the carbon properties. And also there is an issue in driver cache for old segments.It would be not necessary to cache the old segments if the queries are not interested in them.I will start another discussion for this cache issue. -- Thanks & Regards Ravindra |
+1 for this proposal and solution, thanks, Ravi
Regards Liang 2017-10-20 19:13 GMT+05:30 Ravindra Pesala <[hidden email]>: > Hi, > > Problem : > The first-time query of carbon becomes very slow. It is because of reading > many small carbonindex files and cache to the driver at the first time. > Many carbonindex files are created in two cases > Case 1: Loading data in large cluster > For example, if the cluster size is 100 nodes then for each load 100 > index files are created per segment. So after 100 loads, the number of > carbonindex files becomes 10000. > Case 2: Frequent loads > For example, if the load happens for every 5 minutes in 4 node cluster, > it will be more than 10000 index files after 10 days even in 4 node > cluster. > > It will be slower to read all the files from the driver since a lot of > namenode calls and IO operations. > > Solution : > Merge the carbonindex files in two levels.so that we can reduce the IO > calls to namenode and improves the read performance. > > Level 1: Merge within a segment. > Merge the carbonindex files to single file immediately after load completes > within the segment. It would be named as a .carbonindexmerge file. It is > actually not a true data merging but a simple file merge. So that the > current structure of carbonindex files does not change. While reading we > just read one file instead of many carbonindex files within the segment. > > Level 2: Merge across segments. > Merge the already merged carbonindex files of each segment would be merged > after a configurable number of segments reached. These files are placed > under the metadata folder of the table.And the information of these merged > carbonindex files will be updated in the table status file. While reading > the carbonindex files first we check the tablestatus for the availability > of the merged file and read using the information available in it. > For example, the configurable number to merge index files across segments > are 100 then for every 100 segments one new merged index file will be > created under metadata folder and the tablestatus of these 100 segments are > updated with the information of this file. > This file is not updatable and it would be removed only if all the segments > of this merged index file is removed. This file also a simple file merge > not an actual data merge. By default this is disabled and the user can > enable it from the carbon properties. > > And also there is an issue in driver cache for old segments.It would be not > necessary to cache the old segments if the queries are not interested in > them.I will start another discussion for this cache issue. > > -- > Thanks & Regards > Ravindra > |
In reply to this post by ravipesala
Hi Ravindra,
I doubt whether Level 2 merge is required, if the intention is to solve problem of case 2, user can perform data compaction, so that both data and index will be merged using level 1 merge. So it can avoid both small data file and small index file, right? Regards, Jacky Li > 在 2017年10月20日,下午9:43,Ravindra Pesala <[hidden email]> 写道: > > Hi, > > Problem : > The first-time query of carbon becomes very slow. It is because of reading > many small carbonindex files and cache to the driver at the first time. > Many carbonindex files are created in two cases > Case 1: Loading data in large cluster > For example, if the cluster size is 100 nodes then for each load 100 > index files are created per segment. So after 100 loads, the number of > carbonindex files becomes 10000. > Case 2: Frequent loads > For example, if the load happens for every 5 minutes in 4 node cluster, > it will be more than 10000 index files after 10 days even in 4 node cluster. > > It will be slower to read all the files from the driver since a lot of > namenode calls and IO operations. > > Solution : > Merge the carbonindex files in two levels.so that we can reduce the IO > calls to namenode and improves the read performance. > > Level 1: Merge within a segment. > Merge the carbonindex files to single file immediately after load completes > within the segment. It would be named as a .carbonindexmerge file. It is > actually not a true data merging but a simple file merge. So that the > current structure of carbonindex files does not change. While reading we > just read one file instead of many carbonindex files within the segment. > > Level 2: Merge across segments. > Merge the already merged carbonindex files of each segment would be merged > after a configurable number of segments reached. These files are placed > under the metadata folder of the table.And the information of these merged > carbonindex files will be updated in the table status file. While reading > the carbonindex files first we check the tablestatus for the availability > of the merged file and read using the information available in it. > For example, the configurable number to merge index files across segments > are 100 then for every 100 segments one new merged index file will be > created under metadata folder and the tablestatus of these 100 segments are > updated with the information of this file. > This file is not updatable and it would be removed only if all the segments > of this merged index file is removed. This file also a simple file merge > not an actual data merge. By default this is disabled and the user can > enable it from the carbon properties. > > And also there is an issue in driver cache for old segments.It would be not > necessary to cache the old segments if the queries are not interested in > them.I will start another discussion for this cache issue. > > -- > Thanks & Regards > Ravindra |
In reply to this post by ravipesala
If we already have many carbonindex files in cluster, how to merge them, any
tool or command will be available ? or we need to reload the data. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Jacky Li
A very good feature! I think case 1 and case 2 can be handle
We can merge data files and index files after we insert into hdfs automaticlly. In case 1: if the data files are not small, there will be 100 data files and 1 index file. if the data files are small, there will be (dataSize/sizePerFile) data files and 1 index file. When you start to develop this feature, I need this... Best regards! Yuhai Cen 在2017年10月21日 13:02,Jacky Li<[hidden email]> 写道: Hi Ravindra, I doubt whether Level 2 merge is required, if the intention is to solve problem of case 2, user can perform data compaction, so that both data and index will be merged using level 1 merge. So it can avoid both small data file and small index file, right? Regards, Jacky Li > 在 2017年10月20日,下午9:43,Ravindra Pesala <[hidden email]> 写道: > > Hi, > > Problem : > The first-time query of carbon becomes very slow. It is because of reading > many small carbonindex files and cache to the driver at the first time. > Many carbonindex files are created in two cases > Case 1: Loading data in large cluster > For example, if the cluster size is 100 nodes then for each load 100 > index files are created per segment. So after 100 loads, the number of > carbonindex files becomes 10000. > Case 2: Frequent loads > For example, if the load happens for every 5 minutes in 4 node cluster, > it will be more than 10000 index files after 10 days even in 4 node cluster. > > It will be slower to read all the files from the driver since a lot of > namenode calls and IO operations. > > Solution : > Merge the carbonindex files in two levels.so that we can reduce the IO > calls to namenode and improves the read performance. > > Level 1: Merge within a segment. > Merge the carbonindex files to single file immediately after load completes > within the segment. It would be named as a .carbonindexmerge file. It is > actually not a true data merging but a simple file merge. So that the > current structure of carbonindex files does not change. While reading we > just read one file instead of many carbonindex files within the segment. > > Level 2: Merge across segments. > Merge the already merged carbonindex files of each segment would be merged > after a configurable number of segments reached. These files are placed > under the metadata folder of the table.And the information of these merged > carbonindex files will be updated in the table status file. While reading > the carbonindex files first we check the tablestatus for the availability > of the merged file and read using the information available in it. > For example, the configurable number to merge index files across segments > are 100 then for every 100 segments one new merged index file will be > created under metadata folder and the tablestatus of these 100 segments are > updated with the information of this file. > This file is not updatable and it would be removed only if all the segments > of this merged index file is removed. This file also a simple file merge > not an actual data merge. By default this is disabled and the user can > enable it from the carbon properties. > > And also there is an issue in driver cache for old segments.It would be not > necessary to cache the old segments if the queries are not interested in > them.I will start another discussion for this cache issue. > > -- > Thanks & Regards > Ravindra |
Hi ,
@Jacky I feel level 2 merging also required as level 1 does not resolve the problem completely. And yes compaction might solve the issue but in some use cases users do not compact at all. @yaojinguo If the table already has many index files then new load after the upgrade will generate level 2 files across segments. @cenyuhai11 we start developing this feature very soon and it will be delivered in next carbon version. Regards, Ravindra. On 21 October 2017 at 17:39, 岑玉海 <[hidden email]> wrote: > A very good feature! I think case 1 and case 2 can be handle > We can merge data files and index files after we insert into hdfs > automaticlly. > In case 1: > if the data files are not small, there will be 100 data files and 1 index > file. > if the data files are small, there will be (dataSize/sizePerFile) data > files and 1 index file. > > > When you start to develop this feature, I need this... > > > Best regards! > Yuhai Cen > > > 在2017年10月21日 13:02,Jacky Li<[hidden email]> 写道: > Hi Ravindra, > > I doubt whether Level 2 merge is required, if the intention is to solve > problem of case 2, user can perform data compaction, so that both data and > index will be merged using level 1 merge. So it can avoid both small data > file and small index file, right? > > Regards, > Jacky Li > > > 在 2017年10月20日,下午9:43,Ravindra Pesala <[hidden email]> 写道: > > > > Hi, > > > > Problem : > > The first-time query of carbon becomes very slow. It is because of > reading > > many small carbonindex files and cache to the driver at the first time. > > Many carbonindex files are created in two cases > > Case 1: Loading data in large cluster > > For example, if the cluster size is 100 nodes then for each load 100 > > index files are created per segment. So after 100 loads, the number of > > carbonindex files becomes 10000. > > Case 2: Frequent loads > > For example, if the load happens for every 5 minutes in 4 node cluster, > > it will be more than 10000 index files after 10 days even in 4 node > cluster. > > > > It will be slower to read all the files from the driver since a lot of > > namenode calls and IO operations. > > > > Solution : > > Merge the carbonindex files in two levels.so that we can reduce the IO > > calls to namenode and improves the read performance. > > > > Level 1: Merge within a segment. > > Merge the carbonindex files to single file immediately after load > completes > > within the segment. It would be named as a .carbonindexmerge file. It is > > actually not a true data merging but a simple file merge. So that the > > current structure of carbonindex files does not change. While reading we > > just read one file instead of many carbonindex files within the segment. > > > > Level 2: Merge across segments. > > Merge the already merged carbonindex files of each segment would be > merged > > after a configurable number of segments reached. These files are placed > > under the metadata folder of the table.And the information of these > merged > > carbonindex files will be updated in the table status file. While reading > > the carbonindex files first we check the tablestatus for the availability > > of the merged file and read using the information available in it. > > For example, the configurable number to merge index files across segments > > are 100 then for every 100 segments one new merged index file will be > > created under metadata folder and the tablestatus of these 100 segments > are > > updated with the information of this file. > > This file is not updatable and it would be removed only if all the > segments > > of this merged index file is removed. This file also a simple file merge > > not an actual data merge. By default this is disabled and the user can > > enable it from the carbon properties. > > > > And also there is an issue in driver cache for old segments.It would be > not > > necessary to cache the old segments if the queries are not interested in > > them.I will start another discussion for this cache issue. > > > > -- > > Thanks & Regards > > Ravindra > > > > -- Thanks & Regards, Ravi |
In reply to this post by ravipesala
Hi, ravipesala
Thank you for your proposal, merging index file is a very useful feature as we have already met serious performance issue caused by too many index files (case 1). But I think the core problem of case 2 is too many loads which should be mainly considered in segment compaction part. As "one segment one index file" design seems more clear and simple. Regards, Jin Zhou -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Administrator
|
Yes, Jin Zhou.
Merge all index files to one in a segment would be useful feature. it would significantly improve query performance. Regards Liang Jin Zhou wrote > Hi, ravipesala > > Thank you for your proposal, merging index file is a very useful feature > as > we have already met serious performance issue caused by too many index > files > (case 1). > > But I think the core problem of case 2 is too many loads which should be > mainly considered in segment compaction part. As "one segment one index > file" design seems more clear and simple. > > Regards, > Jin Zhou > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Dev,
Currently, Merge index feature is not complete and stable. It has some gaps also, for some of the features like pre-aggregate and streaming, merge index was not supported when it was implemented. We were not able to stabilize and use this feature then. With this discussion, will again work on the Merge index feature as it can improve the performance to a greater extent. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Sounds good. Any plan on this feature ? Will this feature be released with
Carbon 1.5 or 1.4.1? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
This feature will be released in 1.4.1
-- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |