[GitHub] [carbondata] Zhangshunyu opened a new pull request #3600: insert stages list files using iterator

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu opened a new pull request #3600: insert stages list files using iterator

GitBox
Zhangshunyu opened a new pull request #3600: insert stages list files using iterator
URL: https://github.com/apache/carbondata/pull/3600
 
 
   Author:    Zhangshunyu <[hidden email]>
   Date:      Mon Feb 3 14:49:39 2020 +0800
   
    ### Why is this PR needed?
    To optimize the insert stage files using iterator which can avoid listing all files under dir in case file_count is set in command, especially for s3/obs cases and this can reduce the time cost of driver.
   
    ### What changes were proposed in this PR?
   Use iterator to get limited num of stage files.
       
    ### Does this PR introduce any user interface change?
    - No
    - Yes. (please explain the change and update document)
   
    ### Is any new testcase added?
    - No
    - Yes
   
       
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3600: insert stages list files using iterator

GitBox
CarbonDataQA1 commented on issue #3600: insert stages list files using iterator
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-581470314
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/134/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3600: insert stages list files using iterator

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: insert stages list files using iterator
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-581484751
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1838/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r374674044
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/CarbonFile.java
 ##########
 @@ -33,6 +33,9 @@
 
   CarbonFile[] listFiles();
 
+  CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, String path, int fileCount)
 
 Review comment:
   CarbonFile already has a 'path', what is the parameter 'path'? It is confusing if there is no function description in comment

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r374674429
 
 

 ##########
 File path: integration/flink/src/main/java/org/apache/carbon/flink/CarbonLocalWriter.java
 ##########
 @@ -168,7 +168,7 @@ public void commit() throws IOException {
       try {
         String stageInputPath = CarbonTablePath.getStageDir(
             table.getAbsoluteTableIdentifier().getTablePath()) +
-            CarbonCommonConstants.FILE_SEPARATOR + UUID.randomUUID();
+            CarbonCommonConstants.FILE_SEPARATOR + System.currentTimeMillis() + UUID.randomUUID();
 
 Review comment:
   What is this required?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r374674429
 
 

 ##########
 File path: integration/flink/src/main/java/org/apache/carbon/flink/CarbonLocalWriter.java
 ##########
 @@ -168,7 +168,7 @@ public void commit() throws IOException {
       try {
         String stageInputPath = CarbonTablePath.getStageDir(
             table.getAbsoluteTableIdentifier().getTablePath()) +
-            CarbonCommonConstants.FILE_SEPARATOR + UUID.randomUUID();
+            CarbonCommonConstants.FILE_SEPARATOR + System.currentTimeMillis() + UUID.randomUUID();
 
 Review comment:
   What is this required? Please add in comment

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375024487
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/CarbonFile.java
 ##########
 @@ -33,6 +33,9 @@
 
   CarbonFile[] listFiles();
 
+  CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, String path, int fileCount)
 
 Review comment:
   @jackylk handled

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375024561
 
 

 ##########
 File path: integration/flink/src/main/java/org/apache/carbon/flink/CarbonLocalWriter.java
 ##########
 @@ -168,7 +168,7 @@ public void commit() throws IOException {
       try {
         String stageInputPath = CarbonTablePath.getStageDir(
             table.getAbsoluteTableIdentifier().getTablePath()) +
-            CarbonCommonConstants.FILE_SEPARATOR + UUID.randomUUID();
+            CarbonCommonConstants.FILE_SEPARATOR + System.currentTimeMillis() + UUID.randomUUID();
 
 Review comment:
   @jackylk added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-582211222
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/151/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-582222893
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1854/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375332249
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/CarbonFile.java
 ##########
 @@ -33,6 +33,8 @@
 
   CarbonFile[] listFiles();
 
+  CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, int fileCount) throws IOException;
 
 Review comment:
   ```suggestion
     CarbonFile[] listFiles(boolean recursive, int maxCount) throws IOException;
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375333043
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
 ##########
 @@ -513,6 +513,24 @@ public boolean createNewLockFile() throws IOException {
     return carbonFiles;
   }
 
+  @Override
+  public CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, int fileCount)
+      throws IOException {
+    List<CarbonFile> carbonFiles = new ArrayList<>();
+    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    if (null != fileStatus && fileStatus.isDirectory()) {
+      RemoteIterator<LocatedFileStatus> listStatus = fileSystem.listFiles(path, recursive);
+      int i = 0;
 
 Review comment:
   ```suggestion
         int counter = 0;
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375333489
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
 ##########
 @@ -513,6 +513,24 @@ public boolean createNewLockFile() throws IOException {
     return carbonFiles;
   }
 
+  @Override
+  public CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, int fileCount)
+      throws IOException {
+    List<CarbonFile> carbonFiles = new ArrayList<>();
+    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    if (null != fileStatus && fileStatus.isDirectory()) {
+      RemoteIterator<LocatedFileStatus> listStatus = fileSystem.listFiles(path, recursive);
+      int i = 0;
+      while (i < fileCount && listStatus.hasNext()) {
+        LocatedFileStatus locatedFileStatus = listStatus.next();
+        CarbonFile carbonFile = FileFactory.getCarbonFile(locatedFileStatus.getPath().toString());
+        carbonFiles.add(carbonFile);
+        i++;
+      }
+    }
+    return carbonFiles.toArray(new CarbonFile[carbonFiles.size()]);
 
 Review comment:
   ```suggestion
       return carbonFiles.toArray(new CarbonFile[0]);
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375333489
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
 ##########
 @@ -513,6 +513,24 @@ public boolean createNewLockFile() throws IOException {
     return carbonFiles;
   }
 
+  @Override
+  public CarbonFile[] listCarbonFilesByCountIterator(boolean recursive, int fileCount)
+      throws IOException {
+    List<CarbonFile> carbonFiles = new ArrayList<>();
+    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    if (null != fileStatus && fileStatus.isDirectory()) {
+      RemoteIterator<LocatedFileStatus> listStatus = fileSystem.listFiles(path, recursive);
+      int i = 0;
+      while (i < fileCount && listStatus.hasNext()) {
+        LocatedFileStatus locatedFileStatus = listStatus.next();
+        CarbonFile carbonFile = FileFactory.getCarbonFile(locatedFileStatus.getPath().toString());
+        carbonFiles.add(carbonFile);
+        i++;
+      }
+    }
+    return carbonFiles.toArray(new CarbonFile[carbonFiles.size()]);
 
 Review comment:
   ```suggestion
       return carbonFiles.toArray(new CarbonFile[0]);
   ```
   
   please check https://stackoverflow.com/questions/174093/toarraynew-myclass0-or-toarraynew-myclassmylist-size

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-582690755
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/162/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] niuge01 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
niuge01 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375597005
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
 ##########
 @@ -513,6 +513,24 @@ public boolean createNewLockFile() throws IOException {
     return carbonFiles;
   }
 
+  @Override
+  public CarbonFile[] listFiles(boolean recursive, int maxCount)
+      throws IOException {
+    List<CarbonFile> carbonFiles = new ArrayList<>();
+    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    if (null != fileStatus && fileStatus.isDirectory()) {
+      RemoteIterator<LocatedFileStatus> listStatus = fileSystem.listFiles(path, recursive);
+      int counter = 0;
+      while (counter < maxCount && listStatus.hasNext()) {
+        LocatedFileStatus locatedFileStatus = listStatus.next();
+        CarbonFile carbonFile = FileFactory.getCarbonFile(locatedFileStatus.getPath().toString());
+        carbonFiles.add(carbonFile);
+        counter++;
+      }
+    }
+    return carbonFiles.toArray(new CarbonFile[0]);
 
 Review comment:
    Use a constant to replace new CarbonFile[0]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] niuge01 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
niuge01 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375597232
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/LocalCarbonFile.java
 ##########
 @@ -162,6 +163,20 @@ public boolean delete() {
     return carbonFiles;
   }
 
+  @Override
+  public CarbonFile[] listFiles(boolean recursive, int maxCount)
+      throws IOException {
+    List<CarbonFile> carbonFiles = new ArrayList<>();
+    int counter = 0;
+    Iterator it = FileUtils.iterateFiles(file, null, recursive);
+    while (it.hasNext() && counter < maxCount) {
+      CarbonFile carbonFile = new LocalCarbonFile((File) it.next());
+      carbonFiles.add(carbonFile);
+      counter++;
+    }
+    return carbonFiles.toArray(new CarbonFile[0]);
 
 Review comment:
   Use a constant to replace new CarbonFile[0]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#issuecomment-582707435
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1865/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375632842
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/LocalCarbonFile.java
 ##########
 @@ -162,6 +163,20 @@ public boolean delete() {
     return carbonFiles;
   }
 
+  @Override
+  public CarbonFile[] listFiles(boolean recursive, int maxCount)
+      throws IOException {
+    List<CarbonFile> carbonFiles = new ArrayList<>();
+    int counter = 0;
+    Iterator it = FileUtils.iterateFiles(file, null, recursive);
+    while (it.hasNext() && counter < maxCount) {
+      CarbonFile carbonFile = new LocalCarbonFile((File) it.next());
+      carbonFiles.add(carbonFile);
+      counter++;
+    }
+    return carbonFiles.toArray(new CarbonFile[0]);
 
 Review comment:
   From java 8 on, use `CarbonFile[0]` is better, please check please check https://stackoverflow.com/questions/174093/toarraynew-myclass0-or-toarraynew-myclassmylist-size

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kunal642 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage

GitBox
In reply to this post by GitBox
kunal642 commented on a change in pull request #3600: [CARBONDATA-3678] optimize list files in insert stage
URL: https://github.com/apache/carbondata/pull/3600#discussion_r375653673
 
 

 ##########
 File path: integration/flink/src/main/java/org/apache/carbon/flink/CarbonLocalWriter.java
 ##########
 @@ -168,7 +168,7 @@ public void commit() throws IOException {
       try {
         String stageInputPath = CarbonTablePath.getStageDir(
             table.getAbsoluteTableIdentifier().getTablePath()) +
-            CarbonCommonConstants.FILE_SEPARATOR + UUID.randomUUID();
+            CarbonCommonConstants.FILE_SEPARATOR + System.currentTimeMillis() + UUID.randomUUID();
 
 Review comment:
   @Zhangshunyu Can you explain why you need to make the file "ordered by time"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
12