[GitHub] carbondata pull request #2345: [wip] Improve Carbon Reader Schema reading pe...

classic Classic list List threaded Threaded
66 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [wip] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
GitHub user ajantha-bhat opened a pull request:

    https://github.com/apache/carbondata/pull/2345

    [wip] Improve Carbon Reader Schema reading performance on S3

    Problem : Currently carbon reader is reading schema from carbondata file. On s3 multiple IO happens as buffer size is small and data file size is big.
   
    Solution: Read schema from index file and do once IO of index file with a buffer size equal to index file size.
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed? NA
     
     - [ ] Any backward compatibility impacted? NA
     
     - [ ] Document update required? NA
   
     - [ ] Testing done
            Added UT
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.  NA
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajantha-bhat/carbondata master_new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2345.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2345
   
----
commit 59b248bbef97010dc2f5dc697400bb2f85799425
Author: ajantha-bhat <ajanthabhat@...>
Date:   2018-05-27T17:19:23Z

    Improve Carbon Reader Schema reading on S3

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5106/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6113/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4951/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xubo245 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191100739
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema
    +   * @param indexFilePath complete path including index file name
    +   * @return null, if the index file is not present in the path.
    --- End diff --
   
    Can you optimize the expression? and can you add test case for returning null if the index file is not present in the path?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xubo245 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191100925
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema
    --- End diff --
   
    Can you add a empty line?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xubo245 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191101114
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema
    +   * @param indexFilePath complete path including index file name
    +   * @return null, if the index file is not present in the path.
    +   *         List<ColumnSchema> from the index file.
    +   * @throws IOException
    +   */
    +  public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
    --- End diff --
   
     Can you add test case for non  Transactional Table for this API?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xubo245 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191103497
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema
    +   * @param indexFilePath complete path including index file name
    +   * @return null, if the index file is not present in the path.
    +   *         List<ColumnSchema> from the index file.
    +   * @throws IOException
    +   */
    +  public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
    +    CarbonFile indexFile =
    +        FileFactory.getCarbonFile(indexFilePath, FileFactory.getFileType(indexFilePath));
    +    if (!indexFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) {
    +      throw new IOException("Not an index file name");
    +    }
    +    // read schema from the first index file
    +    DataInputStream dataInputStream =
    +        FileFactory.getDataInputStream(indexFilePath, FileFactory.getFileType(indexFilePath));
    +    byte[] bytes = new byte[(int) indexFile.getSize()];
    +    try {
    +      //get the file in byte buffer
    +      dataInputStream.readFully(bytes);
    +      CarbonIndexFileReader indexReader = new CarbonIndexFileReader();
    +      // read from byte buffer.
    +      indexReader.openThriftReader(bytes);
    +      // get the index header
    +      org.apache.carbondata.format.IndexHeader readIndexHeader = indexReader.readIndexHeader();
    +      List<ColumnSchema> columnSchemaList = new ArrayList<ColumnSchema>();
    +      List<org.apache.carbondata.format.ColumnSchema> table_columns =
    +          readIndexHeader.getTable_columns();
    +      for (org.apache.carbondata.format.ColumnSchema columnSchema: table_columns) {
    --- End diff --
   
    small code style: should add space before ':',like columnSchema : table_columns


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191110093
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema
    +   * @param indexFilePath complete path including index file name
    +   * @return null, if the index file is not present in the path.
    --- End diff --
   
    Actually now it will not return null, It will throw exception. That why it is WIP. I will remove this from the method header.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191110113
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema
    --- End diff --
   
    OK


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191110214
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema
    +   * @param indexFilePath complete path including index file name
    +   * @return null, if the index file is not present in the path.
    +   *         List<ColumnSchema> from the index file.
    +   * @throws IOException
    +   */
    +  public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
    --- End diff --
   
    This test case is redundant. As new API in this PR, just need path. It doesn't matter whether reader is transactional or not. Hence not added


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191110226
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema
    +   * @param indexFilePath complete path including index file name
    +   * @return null, if the index file is not present in the path.
    +   *         List<ColumnSchema> from the index file.
    +   * @throws IOException
    +   */
    +  public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
    +    CarbonFile indexFile =
    +        FileFactory.getCarbonFile(indexFilePath, FileFactory.getFileType(indexFilePath));
    +    if (!indexFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) {
    +      throw new IOException("Not an index file name");
    +    }
    +    // read schema from the first index file
    +    DataInputStream dataInputStream =
    +        FileFactory.getDataInputStream(indexFilePath, FileFactory.getFileType(indexFilePath));
    +    byte[] bytes = new byte[(int) indexFile.getSize()];
    +    try {
    +      //get the file in byte buffer
    +      dataInputStream.readFully(bytes);
    +      CarbonIndexFileReader indexReader = new CarbonIndexFileReader();
    +      // read from byte buffer.
    +      indexReader.openThriftReader(bytes);
    +      // get the index header
    +      org.apache.carbondata.format.IndexHeader readIndexHeader = indexReader.readIndexHeader();
    +      List<ColumnSchema> columnSchemaList = new ArrayList<ColumnSchema>();
    +      List<org.apache.carbondata.format.ColumnSchema> table_columns =
    +          readIndexHeader.getTable_columns();
    +      for (org.apache.carbondata.format.ColumnSchema columnSchema: table_columns) {
    --- End diff --
   
    OK.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    It seems the method you added is dead code since no one calls it except from test.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    @xuchuanyin : This is one of the exposed method in SDK reader now. External client can make use of it to get schema by passing index file path.
   
    You can refer the other methods in this file. This is not a dead code


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [CARBONDATA-2557] Improve Carbon Reader Schema readi...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6149/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [CARBONDATA-2557] Improve Carbon Reader Schema readi...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4987/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [CARBONDATA-2557] Improve Carbon Reader Schema readi...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5132/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2345: [CARBONDATA-2557] Improve Carbon Reader Schem...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sounakr commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2345#discussion_r191328391
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
    @@ -101,6 +108,45 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
         return reader.readSchema();
       }
     
    +  /**
    +   * Read carbonindex file and return the schema.
    +   * This API is better use when index file is present,
    +   * as It is faster compared to readSchemaInDataFile when carbondata file is big in size
    +   *
    +   * @param indexFilePath complete path including index file name
    +   * @return List<ColumnSchema> from the index file.
    +   * @throws IOException
    +   */
    +  public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
    --- End diff --
   
    In case we have a performance result for comparison between reading from CarbonIndexFile and CarbonDataFile, and if reading from CarbonIndexFile Shows better output, then better to read from Index File in all places, i.e. whenever internally we infer schema.



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [WIP][CARBONDATA-2557] Improve Carbon Reader Schema ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5025/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2345: [WIP][CARBONDATA-2557] Improve Carbon Reader Schema ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2345
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6186/



---
1234