Zhangshunyu opened a new pull request #3600: insert stages list files using iterator
URL: https://github.com/apache/carbondata/pull/3600 Author: Zhangshunyu <[hidden email]> Date: Mon Feb 3 14:49:39 2020 +0800 ### Why is this PR needed? To optimize the insert stage files using iterator which can avoid listing all files under dir in case file_count is set in command, especially for s3/obs cases and this can reduce the time cost of driver. ### What changes were proposed in this PR? Use iterator to get limited num of stage files. ### Does this PR introduce any user interface change? - No - Yes. (please explain the change and update document) ### Is any new testcase added? - No - Yes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
CarbonDataQA1 commented on issue #3600: insert stages list files using iterator
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-581470314 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/134/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: insert stages list files using iterator
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-581484751 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1838/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r374674044 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/CarbonFile.java ########## @@ -33,6 +33,9 @@ CarbonFile[] listFiles(); + CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, String path, int fileCount) Review comment: CarbonFile already has a 'path', what is the parameter 'path'? It is confusing if there is no function description in comment ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r374674429 ########## File path: integration/flink/src/main/java/org/apache/carbon/flink/CarbonLocalWriter.java ########## @@ -168,7 +168,7 @@ public void commit() throws IOException { try { String stageInputPath = CarbonTablePath.getStageDir( table.getAbsoluteTableIdentifier().getTablePath()) + - CarbonCommonConstants.FILE_SEPARATOR + UUID.randomUUID(); + CarbonCommonConstants.FILE_SEPARATOR + System.currentTimeMillis() + UUID.randomUUID(); Review comment: What is this required? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r374674429 ########## File path: integration/flink/src/main/java/org/apache/carbon/flink/CarbonLocalWriter.java ########## @@ -168,7 +168,7 @@ public void commit() throws IOException { try { String stageInputPath = CarbonTablePath.getStageDir( table.getAbsoluteTableIdentifier().getTablePath()) + - CarbonCommonConstants.FILE_SEPARATOR + UUID.randomUUID(); + CarbonCommonConstants.FILE_SEPARATOR + System.currentTimeMillis() + UUID.randomUUID(); Review comment: What is this required? Please add in comment ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375024487 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/CarbonFile.java ########## @@ -33,6 +33,9 @@ CarbonFile[] listFiles(); + CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, String path, int fileCount) Review comment: @jackylk handled ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375024561 ########## File path: integration/flink/src/main/java/org/apache/carbon/flink/CarbonLocalWriter.java ########## @@ -168,7 +168,7 @@ public void commit() throws IOException { try { String stageInputPath = CarbonTablePath.getStageDir( table.getAbsoluteTableIdentifier().getTablePath()) + - CarbonCommonConstants.FILE_SEPARATOR + UUID.randomUUID(); + CarbonCommonConstants.FILE_SEPARATOR + System.currentTimeMillis() + UUID.randomUUID(); Review comment: @jackylk added ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-582211222 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/151/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-582222893 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1854/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375332249 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/CarbonFile.java ########## @@ -33,6 +33,8 @@ CarbonFile[] listFiles(); + CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, int fileCount) throws IOException; Review comment: ```suggestion CarbonFile[] listFiles(boolean recursive, int maxCount) throws IOException; ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375333043 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java ########## @@ -513,6 +513,24 @@ public boolean createNewLockFile() throws IOException { return carbonFiles; } + @Override + public CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, int fileCount) + throws IOException { + List<CarbonFile> carbonFiles = new ArrayList<>(); + FileStatus fileStatus = fileSystem.getFileStatus(path); + if (null != fileStatus && fileStatus.isDirectory()) { + RemoteIterator<LocatedFileStatus> listStatus = fileSystem.listFiles(path, recursive); + int i = 0; Review comment: ```suggestion int counter = 0; ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375333489 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java ########## @@ -513,6 +513,24 @@ public boolean createNewLockFile() throws IOException { return carbonFiles; } + @Override + public CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, int fileCount) + throws IOException { + List<CarbonFile> carbonFiles = new ArrayList<>(); + FileStatus fileStatus = fileSystem.getFileStatus(path); + if (null != fileStatus && fileStatus.isDirectory()) { + RemoteIterator<LocatedFileStatus> listStatus = fileSystem.listFiles(path, recursive); + int i = 0; + while (i < fileCount && listStatus.hasNext()) { + LocatedFileStatus locatedFileStatus = listStatus.next(); + CarbonFile carbonFile = FileFactory.getCarbonFile(locatedFileStatus.getPath().toString()); + carbonFiles.add(carbonFile); + i++; + } + } + return carbonFiles.toArray(new CarbonFile[carbonFiles.size()]); Review comment: ```suggestion return carbonFiles.toArray(new CarbonFile[0]); ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375333489 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java ########## @@ -513,6 +513,24 @@ public boolean createNewLockFile() throws IOException { return carbonFiles; } + @Override + public CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, int fileCount) + throws IOException { + List<CarbonFile> carbonFiles = new ArrayList<>(); + FileStatus fileStatus = fileSystem.getFileStatus(path); + if (null != fileStatus && fileStatus.isDirectory()) { + RemoteIterator<LocatedFileStatus> listStatus = fileSystem.listFiles(path, recursive); + int i = 0; + while (i < fileCount && listStatus.hasNext()) { + LocatedFileStatus locatedFileStatus = listStatus.next(); + CarbonFile carbonFile = FileFactory.getCarbonFile(locatedFileStatus.getPath().toString()); + carbonFiles.add(carbonFile); + i++; + } + } + return carbonFiles.toArray(new CarbonFile[carbonFiles.size()]); Review comment: ```suggestion return carbonFiles.toArray(new CarbonFile[0]); ``` please check https://stackoverflow.com/questions/174093/toarraynew-myclass0-or-toarraynew-myclassmylist-size ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-582690755 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/162/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
niuge01 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375597005 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java ########## @@ -513,6 +513,24 @@ public boolean createNewLockFile() throws IOException { return carbonFiles; } + @Override + public CarbonFile[] listFiles(boolean recursive, int maxCount) + throws IOException { + List<CarbonFile> carbonFiles = new ArrayList<>(); + FileStatus fileStatus = fileSystem.getFileStatus(path); + if (null != fileStatus && fileStatus.isDirectory()) { + RemoteIterator<LocatedFileStatus> listStatus = fileSystem.listFiles(path, recursive); + int counter = 0; + while (counter < maxCount && listStatus.hasNext()) { + LocatedFileStatus locatedFileStatus = listStatus.next(); + CarbonFile carbonFile = FileFactory.getCarbonFile(locatedFileStatus.getPath().toString()); + carbonFiles.add(carbonFile); + counter++; + } + } + return carbonFiles.toArray(new CarbonFile[0]); Review comment: Use a constant to replace new CarbonFile[0] ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
niuge01 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375597232 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/LocalCarbonFile.java ########## @@ -162,6 +163,20 @@ public boolean delete() { return carbonFiles; } + @Override + public CarbonFile[] listFiles(boolean recursive, int maxCount) + throws IOException { + List<CarbonFile> carbonFiles = new ArrayList<>(); + int counter = 0; + Iterator it = FileUtils.iterateFiles(file, null, recursive); + while (it.hasNext() && counter < maxCount) { + CarbonFile carbonFile = new LocalCarbonFile((File) it.next()); + carbonFiles.add(carbonFile); + counter++; + } + return carbonFiles.toArray(new CarbonFile[0]); Review comment: Use a constant to replace new CarbonFile[0] ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-582707435 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1865/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375632842 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/LocalCarbonFile.java ########## @@ -162,6 +163,20 @@ public boolean delete() { return carbonFiles; } + @Override + public CarbonFile[] listFiles(boolean recursive, int maxCount) + throws IOException { + List<CarbonFile> carbonFiles = new ArrayList<>(); + int counter = 0; + Iterator it = FileUtils.iterateFiles(file, null, recursive); + while (it.hasNext() && counter < maxCount) { + CarbonFile carbonFile = new LocalCarbonFile((File) it.next()); + carbonFiles.add(carbonFile); + counter++; + } + return carbonFiles.toArray(new CarbonFile[0]); Review comment: From java 8 on, use `CarbonFile[0]` is better, please check please check https://stackoverflow.com/questions/174093/toarraynew-myclass0-or-toarraynew-myclassmylist-size ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
kunal642 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375653673 ########## File path: integration/flink/src/main/java/org/apache/carbon/flink/CarbonLocalWriter.java ########## @@ -168,7 +168,7 @@ public void commit() throws IOException { try { String stageInputPath = CarbonTablePath.getStageDir( table.getAbsoluteTableIdentifier().getTablePath()) + - CarbonCommonConstants.FILE_SEPARATOR + UUID.randomUUID(); + CarbonCommonConstants.FILE_SEPARATOR + System.currentTimeMillis() + UUID.randomUUID(); Review comment: @Zhangshunyu Can you explain why you need to make the file "ordered by time" ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |