[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
GitHub user ravipesala opened a pull request:

    https://github.com/apache/carbondata/pull/2661

    [CARBONDATA-2888] Support multi level subfolder for SDK read and fileformat read

    This PR supports multi-level subfolders read for SDK Reader and spark's carbon fileformat reader.
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
   
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ravipesala/incubator-carbondata sdk-multi-folder-support

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2661.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2661
   
----
commit a71189330b223a76c0b40ee8abf28ce2fd9733c1
Author: ravipesala <ravi.pesala@...>
Date:   2018-08-26T17:17:14Z

    Support multi level subfolder for SDK read and fileformat read

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6413/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8072/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/14/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r212897009
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/SegmentIndexFileStore.java ---
    @@ -338,6 +338,24 @@ private MergedBlockIndex readMergeBlockIndex(ThriftReader thriftReader) throws I
         });
       }
     
    +  /**
    +   * List all the index files of the segment.
    +   *
    +   * @param carbonFile directory
    +   */
    +  public static void getCarbonIndexFilesRecursively(CarbonFile carbonFile,
    --- End diff --
   
    Why not return the list instead of passing it as a parameter?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r212897508
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/CarbonTable.java ---
    @@ -268,6 +257,18 @@ public boolean accept(CarbonFile file) {
         return CarbonTable.buildFromTableInfo(tableInfoInfer);
       }
     
    +  private static CarbonFile getFirstIndexFile(CarbonFile tablePath) {
    +    CarbonFile[] carbonFiles = tablePath.listFiles();
    +    for (CarbonFile carbonFile : carbonFiles) {
    +      if (carbonFile.isDirectory()) {
    +        return getFirstIndexFile(carbonFile);
    +      } else if (carbonFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) {
    +        return carbonFile;
    +      }
    +    }
    +    return null;
    --- End diff --
   
    why not just throw exception if the file does not exist?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r213570439
 
    --- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
    @@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
           fileIndex
         }
       }
    +
    +  /**
    +   * Get datafolders recursively
    +   */
    +  private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
    --- End diff --
   
    Is the input CarbonFile a table folder? If so, please rename tableFolder


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r213574978
 
    --- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
    @@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
           fileIndex
         }
       }
    +
    +  /**
    +   * Get datafolders recursively
    +   */
    +  private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
    +    val files = carbonFile.listFiles()
    +    var folders: Seq[CarbonFile] = Seq()
    +    files.foreach { f =>
    +      if (f.isDirectory) {
    +        val files = f.listFiles()
    +        if (files.nonEmpty && !files(0).isDirectory) {
    +          folders = Seq(f) ++ folders
    --- End diff --
   
    Will this create too many immutable object? Would ArrayBuffer better?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r214273342
 
    --- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
    @@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
           fileIndex
         }
       }
    +
    +  /**
    +   * Get datafolders recursively
    +   */
    +  private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
    +    val files = carbonFile.listFiles()
    +    var folders: Seq[CarbonFile] = Seq()
    +    files.foreach { f =>
    +      if (f.isDirectory) {
    +        val files = f.listFiles()
    +        if (files.nonEmpty && !files(0).isDirectory) {
    +          folders = Seq(f) ++ folders
    +        } else {
    +          folders = getDataFolders(f) ++ folders
    --- End diff --
   
    This statement can be moved under files.nonEmpty check


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r214737914
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/SegmentIndexFileStore.java ---
    @@ -338,6 +338,24 @@ private MergedBlockIndex readMergeBlockIndex(ThriftReader thriftReader) throws I
         });
       }
     
    +  /**
    +   * List all the index files of the segment.
    +   *
    +   * @param carbonFile directory
    +   */
    +  public static void getCarbonIndexFilesRecursively(CarbonFile carbonFile,
    --- End diff --
   
    Its a recursive call, and it adds file to the passed list if it finds any.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r214737987
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/CarbonTable.java ---
    @@ -268,6 +257,18 @@ public boolean accept(CarbonFile file) {
         return CarbonTable.buildFromTableInfo(tableInfoInfer);
       }
     
    +  private static CarbonFile getFirstIndexFile(CarbonFile tablePath) {
    +    CarbonFile[] carbonFiles = tablePath.listFiles();
    +    for (CarbonFile carbonFile : carbonFiles) {
    +      if (carbonFile.isDirectory()) {
    +        return getFirstIndexFile(carbonFile);
    +      } else if (carbonFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) {
    +        return carbonFile;
    +      }
    +    }
    +    return null;
    --- End diff --
   
    It is handled in caller method


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r214738056
 
    --- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
    @@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
           fileIndex
         }
       }
    +
    +  /**
    +   * Get datafolders recursively
    +   */
    +  private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
    --- End diff --
   
    ok


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r214738731
 
    --- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
    @@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
           fileIndex
         }
       }
    +
    +  /**
    +   * Get datafolders recursively
    +   */
    +  private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
    +    val files = carbonFile.listFiles()
    +    var folders: Seq[CarbonFile] = Seq()
    +    files.foreach { f =>
    +      if (f.isDirectory) {
    +        val files = f.listFiles()
    +        if (files.nonEmpty && !files(0).isDirectory) {
    +          folders = Seq(f) ++ folders
    --- End diff --
   
    ok


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8273/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/202/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/207/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8278/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2661#discussion_r214899510
 
    --- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
    @@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
           fileIndex
         }
       }
    +
    +  /**
    +   * Get datafolders recursively
    +   */
    +  private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
    +    val files = carbonFile.listFiles()
    +    var folders: Seq[CarbonFile] = Seq()
    +    files.foreach { f =>
    +      if (f.isDirectory) {
    +        val files = f.listFiles()
    +        if (files.nonEmpty && !files(0).isDirectory) {
    +          folders = Seq(f) ++ folders
    +        } else {
    +          folders = getDataFolders(f) ++ folders
    --- End diff --
   
    In the next call to getDataFolders() for empty folder it skips the iteration. So, no problem with the logic. you can ignore the comment.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    LGTM


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/2661
 
    LGTM


---
12