jack86596 opened a new pull request #4105: URL: https://github.com/apache/carbondata/pull/4105 ### Why is this PR needed? Reindex failed when SI has stale carbonindexmerge file, throw exception FileNotFoundException. This is because SegmentFileStore.getIndexFiles stores the mapping of indexfile to indexmergefile, when stale carbon indexmergefile exists, indexmergefile will not be null. During merging index file, new indexmergefile will be created with same name as before in the same location. At the end of CarbonIndexFileMergeWriter.writeMergeIndexFileBasedOnSegmentFile, carbon index file will be deleted. Since indexmergefile is stored in the indexFiles list, newly created indexmergefile will be delete also, which leads to FileNotFoundException. ### What changes were proposed in this PR? 1. SegmentFileStore.getIndexFiles stores the mapping of indexfile to indexmergefile which is redundant. 2. SegmentFileStore.getIndexOrMergeFiles returns both index file and index merge file, so function name is incorrect, rename to getIndexAndMergeFiles. 3. CarbonLoaderUtil.getActiveExecutor actually get active node, so function name is incorrect, rename to getActiveNode, together replace all "executor" with "node" in function assignBlocksByDataLocality. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - Yes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
CarbonDataQA2 commented on pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#issuecomment-797370576 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5562/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#issuecomment-797371226 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3796/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#issuecomment-799208880 @jack86596 , won't running clean files before doing the reindex operation solve this issue? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r594123727 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: technically speaking reindex should not be allowed at this point, because the segment still exists(even if it is MFD), the purpose of reindex command is to reload missing segments which is not the case here. Also, will this same behaviour also happen for other segment status, say success or partially success? @akashrn5 can give some input maybe ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jack86596 commented on pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#issuecomment-799227509 > @jack86596 , won't running clean files before doing the reindex operation solve this issue? why before reindex operation, need to run clean files? and dependency relation between reindex and clean files? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jack86596 edited a comment on pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#issuecomment-799227509 > @jack86596 , won't running clean files before doing the reindex operation solve this issue? why before reindex operation, need to run clean files? any dependency relation between reindex and clean files? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jack86596 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r594143885 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: So please provide a better solution to solve this issue: main table segment (success), SI table segment (marked for delete). The solution should not be "run clean files for SI table first", because it is not better than current one. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#issuecomment-799320505 > > @jack86596 , won't running clean files before doing the reindex operation solve this issue? > > why before reindex operation, need to run clean files? any dependency relation between reindex and clean files? No, there is no need to do clean files before reindex there is no dependency, but we only do reindex when the segments are missing, it is not the case here though. What i meant was that running clean files on si table would solve your problem ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jack86596 commented on pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#issuecomment-799346641 > > > @jack86596 , won't running clean files before doing the reindex operation solve this issue? > > > > > > why before reindex operation, need to run clean files? any dependency relation between reindex and clean files? > > No, there is no need to do clean files before reindex there is no dependency, but we only do reindex when the segments are missing, it is not the case here though. What i meant was that running clean files on si table would solve your problem why I need to run one extra command to solve the problem, why not just solve the problem directly, in this way, no need to run extra command? If in order to solve the problem, you need to run clean files, then this is dependency. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jack86596 edited a comment on pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#issuecomment-799346641 > > > @jack86596 , won't running clean files before doing the reindex operation solve this issue? > > > > > > why before reindex operation, need to run clean files? any dependency relation between reindex and clean files? > > No, there is no need to do clean files before reindex there is no dependency, but we only do reindex when the segments are missing, it is not the case here though. What i meant was that running clean files on si table would solve your problem why I need to run one extra command to solve the problem, why not just solve the problem directly, in this way, no need to run extra command? If in order to solve the problem, you need to run clean files, then this is dependency(The success of reindex command depends on clean files command). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Indhumathi27 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595110340 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: i think, at first, delete segment directly from indexTable should not be allowed. For second case, where main table contains SUCCESS segments and SI has MFD, during reindex, before loading data again, we can delete old files in that segment ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Indhumathi27 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595110340 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: i think, at first, delete segment directly from indexTable should not be allowed. For second case, where main table contains SUCCESS segments and SI has MFD, during reindex, before loading data again, we can delete old stale files in that segment ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jack86596 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595136721 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: 1. "i think, at first, delete segment directly from indexTable should not be allowed." This is unit test, we are simulating the issue(main table semgent success and SI table segment MFD), for real user, they don't need to manually delete SI segment in order to get the issue, they will face this issue directly in their daily running without any manuallly operation. 2. "For second case, where main table contains SUCCESS segments and SI has MFD, during reindex, before loading data again, we can delete old stale files in that segment" Yes, you provide a solution and i agree this will work, but again, the solution you provide is not as good as current one because current one you don't need to delete the stale files, reindex can also work. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Indhumathi27 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595146990 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: 1. User can face this scenario, what i meant is, allowing delete directly on index Table is also a new issue, when user does it by mistake, will make main table and SI not in sync. Then, user unneesarily has to perform Reindex. So, you can handle this scenario in this PR. 2. If you are gonna return only index Files, always merge file name will be null. So, you can use return List of index Files alone instead of Map from getIndexFiles method. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Indhumathi27 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595146990 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: 1. User can face this scenario, what i meant is, allowing delete directly on index Table is also a new issue, when user does it by mistake, will make main table and SI not in sync. Then, user unneesarily has to perform Reindex. So, you can handle this scenario in this PR. 2. If you are gonna return only index Files, always merge file name will be null. So, you can return List of index Files alone instead of Map from getIndexFiles method. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jack86596 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595161142 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: 1. User will not delete SI segment directly, the issue (main table segment success and SI segment MFD) is because carbon has so many bugs which will lead to this situation even user only just do the loading and compaction. And actually in production, we (maintainers) need to "delete directly on index table" just because we need to recovery carbon table from broken state to normal state, then later trigger reindex to repair the broken SI segment. So until "all the bugs which will lead to main table and SI table not in syc" are solved, we cannot block this. 2. This "return List of index Files alone" if you say so, I can change it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
VenuReddy2103 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595165241 ########## File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java ########## @@ -841,8 +841,7 @@ public SegmentFile getSegmentFile() { } if (entry.getValue().status.equals(SegmentStatus.SUCCESS.getMessage())) { for (String indexFile : entry.getValue().getFiles()) { - indexFiles.put(location + CarbonCommonConstants.FILE_SEPARATOR + indexFile, - entry.getValue().mergeFileName); + indexFiles.put(location + CarbonCommonConstants.FILE_SEPARATOR + indexFile, null); Review comment: Its caller `SegmentFileStore.getIndexCarbonFiles()`has code to handle for non null value to add merge index file also to index file list. It becomes dead code now. You would want to remove that too. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
VenuReddy2103 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595165241 ########## File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java ########## @@ -841,8 +841,7 @@ public SegmentFile getSegmentFile() { } if (entry.getValue().status.equals(SegmentStatus.SUCCESS.getMessage())) { for (String indexFile : entry.getValue().getFiles()) { - indexFiles.put(location + CarbonCommonConstants.FILE_SEPARATOR + indexFile, - entry.getValue().mergeFileName); + indexFiles.put(location + CarbonCommonConstants.FILE_SEPARATOR + indexFile, null); Review comment: Its caller `SegmentFileStore.getIndexCarbonFiles()`has code to handle for non null value to add merge index file also to index file list(at line 906). It becomes dead code now. You would want to remove that too. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jack86596 commented on a change in pull request #4105: URL: https://github.com/apache/carbondata/pull/4105#discussion_r595167863 ########## File path: index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/secondaryindex/TestIndexRepair.scala ########## @@ -119,6 +119,19 @@ class TestIndexRepair extends QueryTest with BeforeAndAfterAll { sql("drop table if exists maintable") } + test("reindex command with stale files") { + sql("drop table if exists maintable") + sql("CREATE TABLE maintable(a INT, b STRING, c STRING) stored as carbondata") + sql("CREATE INDEX indextable1 on table maintable(c) as 'carbondata'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("INSERT INTO maintable SELECT 1,'string1', 'string2'") + sql("DELETE FROM TABLE INDEXTABLE1 WHERE SEGMENT.ID IN(0,1,2)") + sql("REINDEX INDEX TABLE indextable1 ON MAINTABLE WHERE SEGMENT.ID IN (0,1)") Review comment: 1. For example, if both main table and SI table segment status are success, but one of the carbondata file or index file in SI segment is missing or broken, how you fix this if you cannot manually delete this SI segment? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
Free forum by Nabble | Edit this page |