marchpure opened a new pull request #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621 …e flow ### Why is this PR needed? Flink writes files to obs with setting the filename in order of time, when loading stages files to carbondata, files are list and loaded with the parameter "maxcount" which will return "maxcount" files with smaller filename, in the other words, "maxcount" files with earliest generation time. but in same senario, it is possible that the filename of stage files is not in order of time, a switch is used here to judge whether to list files with specify batch size, or just list all files in the stage dir. ### What changes were proposed in this PR? A swith "CARBON_STAGE_FILENAME_IS_IN_ORDER_OF_TIME" is added. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - No ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
CarbonDataQA1 commented on issue #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621#issuecomment-586395416 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/296/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621#issuecomment-586432483 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2000/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621#discussion_r379985373 ########## File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonInsertFromStageCommand.scala ########## @@ -502,13 +503,23 @@ case class CarbonInsertFromStageCommand( ): Array[(CarbonFile, CarbonFile)] = { val dir = FileFactory.getCarbonFile(loadDetailsDir, hadoopConf) if (dir.exists()) { - // Only HDFS/OBS/S3 server side can guarantee the files got from iterator are sorted - // based on file name so that we can use iterator to get the A and A.success together - // without loop all files which can improve performance compared with list all files. - // One file and another with '.success', so we need *2 as total and this value is just - // an approximate value. For local files, as can it can we not guarantee the order, we - // just list all. - val allFiles = dir.listFiles(false, batchSize * 2) + // It is possible that the filename of stage files is not in order of time, Review comment: Can you explain in which scenario it won't be ordered of time ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621#discussion_r379985673 ########## File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonInsertFromStageCommand.scala ########## @@ -502,13 +503,23 @@ case class CarbonInsertFromStageCommand( ): Array[(CarbonFile, CarbonFile)] = { val dir = FileFactory.getCarbonFile(loadDetailsDir, hadoopConf) if (dir.exists()) { - // Only HDFS/OBS/S3 server side can guarantee the files got from iterator are sorted - // based on file name so that we can use iterator to get the A and A.success together - // without loop all files which can improve performance compared with list all files. - // One file and another with '.success', so we need *2 as total and this value is just - // an approximate value. For local files, as can it can we not guarantee the order, we - // just list all. - val allFiles = dir.listFiles(false, batchSize * 2) + // It is possible that the filename of stage files is not in order of time, + // A switch is used here to judge whether to list files with specify batch size + val CARBON_STAGE_FILENAME_IS_IN_ORDER_OF_TIME = CarbonProperties.getInstance().getProperty( Review comment: Also need validation, toBoolean may fail if user sets a non-boolean value say 123 by mistake. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621#discussion_r379985844 ########## File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonInsertFromStageCommand.scala ########## @@ -502,13 +503,23 @@ case class CarbonInsertFromStageCommand( ): Array[(CarbonFile, CarbonFile)] = { val dir = FileFactory.getCarbonFile(loadDetailsDir, hadoopConf) if (dir.exists()) { - // Only HDFS/OBS/S3 server side can guarantee the files got from iterator are sorted - // based on file name so that we can use iterator to get the A and A.success together - // without loop all files which can improve performance compared with list all files. - // One file and another with '.success', so we need *2 as total and this value is just - // an approximate value. For local files, as can it can we not guarantee the order, we - // just list all. - val allFiles = dir.listFiles(false, batchSize * 2) + // It is possible that the filename of stage files is not in order of time, + // A switch is used here to judge whether to list files with specify batch size + val CARBON_STAGE_FILENAME_IS_IN_ORDER_OF_TIME = CarbonProperties.getInstance().getProperty( Review comment: change variable name to `isSortedFileNames` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621#discussion_r379986679 ########## File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ########## @@ -1593,6 +1593,14 @@ private CarbonCommonConstants() { public static final String SUPPORT_DIRECT_QUERY_ON_DATAMAP_DEFAULTVALUE = "false"; + // Whether the filename of stage files is in order of time + @CarbonProperty Review comment: Need to update document also for any new carbon property added ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621#discussion_r380417112 ########## File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ########## @@ -1593,6 +1593,14 @@ private CarbonCommonConstants() { public static final String SUPPORT_DIRECT_QUERY_ON_DATAMAP_DEFAULTVALUE = "false"; + // Whether the filename of stage files is in order of time + @CarbonProperty Review comment: make it dynamic configurable ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
kunal642 commented on issue #3621: [HOTFIX] Support both listfile() and listfile(maxCount) in InsertStag…
URL: https://github.com/apache/carbondata/pull/3621#issuecomment-605785576 @marchpure Please rebase and fix the comments ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |