akashrn5 opened a new pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677 ### Why is this PR needed? ### What changes were proposed in this PR? ### Does this PR introduce any user interface change? - No - Yes. (please explain the change and update document) ### Is any new testcase added? - No - Yes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
CarbonDataQA1 commented on issue #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#issuecomment-603170267 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/840/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#issuecomment-603170366 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2547/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#issuecomment-603255390 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/841/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#issuecomment-603288336 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2548/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r397299238 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: For a clean approach, Need to use `SegmentRefreshInfo`, for transactional table , only update scenario will use this. can use for normal scenario also. For nontransactional carbon table. Already it is used for this non-update scenario. Refer `LatestFilesReadCommittedScope#takeCarbonIndexFileSnapShot` to see how it is filled. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r397305548 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: But that case, we will be doing list files, get all the files inside segment, timestamp comparison, i dont think this will be feasible, here, just i can go for segment file, one file timestamp checking would be feasible and in future also, any segment file related updation will be taken care. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r397307717 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: That changes done way before segment file, now no need to list index files. you can use current changes but fill that timestamp in `SegmentRefreshInfo`. It will take care of cache clean up part ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#issuecomment-603391649 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/842/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#issuecomment-603396560 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2549/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r397343014 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: this was done for non transactional table, in transactional case, we have scenarios where we have millions of files inside a segment, so i think this approach will be better with respect to performance, and in future all operations will be based on segment file, so i think it will be feasible. what u think? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r398315705 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: As discussed, we can use SegmentRefreshInfo interface in this scneario (no need to list index file, can get info from segment file) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r398315705 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: As discussed, we can use SegmentRefreshInfo interface in this scenario (no need to list index file, can get info from segment file) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r399658587 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: please refer #3686 i have created new PR ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 closed pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r399987268 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: It was WIP, would have handled in the same PR ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3677: [wip]Fix segment cache issue with parallel spark applications on same store
URL: https://github.com/apache/carbondata/pull/3677#discussion_r400000598 ########## File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java ########## @@ -369,6 +379,33 @@ private void modifyColumnSchemaForSortColumn(ColumnSchema columnSchema, boolean return tableBlockIndexUniqueIdentifiers; } + /** + * This case is added for a case where, there are two applications running, and in one application + * operations happened like SI rebuild, update or delete case, then the cache should be updated as + * well. The cache updation happens for same application, but other application may fail to query + * or may give wrong result. Since we overwrite the segment file in these scenarios, check the + * timestamp, and if modified, clear from the cache. + */ + private void clearSegmentMapIfSegmentUpdated(String latestSegmentFilePath, Segment segment) { + SegmentBlockIndexInfo segmentBlockIndexInfo = segmentMap.get(segment.getSegmentNo()); Review comment: its k, no issues ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |