GitHub user ajantha-bhat opened a pull request:
https://github.com/apache/carbondata/pull/2345 [wip] Improve Carbon Reader Schema reading performance on S3 Problem : Currently carbon reader is reading schema from carbondata file. On s3 multiple IO happens as buffer size is small and data file size is big. Solution: Read schema from index file and do once IO of index file with a buffer size equal to index file size. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? NA - [ ] Any backward compatibility impacted? NA - [ ] Document update required? NA - [ ] Testing done Added UT - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/ajantha-bhat/carbondata master_new Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2345.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2345 ---- commit 59b248bbef97010dc2f5dc697400bb2f85799425 Author: ajantha-bhat <ajanthabhat@...> Date: 2018-05-27T17:19:23Z Improve Carbon Reader Schema reading on S3 ---- --- |
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2345 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5106/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2345 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6113/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2345 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4951/ --- |
In reply to this post by qiuchenjian-2
Github user xubo245 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191100739 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema + * @param indexFilePath complete path including index file name + * @return null, if the index file is not present in the path. --- End diff -- Can you optimize the expression? and can you add test case for returning null if the index file is not present in the path? --- |
In reply to this post by qiuchenjian-2
Github user xubo245 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191100925 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema --- End diff -- Can you add a empty line? --- |
In reply to this post by qiuchenjian-2
Github user xubo245 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191101114 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema + * @param indexFilePath complete path including index file name + * @return null, if the index file is not present in the path. + * List<ColumnSchema> from the index file. + * @throws IOException + */ + public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException { --- End diff -- Can you add test case for non Transactional Table for this API? --- |
In reply to this post by qiuchenjian-2
Github user xubo245 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191103497 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema + * @param indexFilePath complete path including index file name + * @return null, if the index file is not present in the path. + * List<ColumnSchema> from the index file. + * @throws IOException + */ + public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException { + CarbonFile indexFile = + FileFactory.getCarbonFile(indexFilePath, FileFactory.getFileType(indexFilePath)); + if (!indexFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) { + throw new IOException("Not an index file name"); + } + // read schema from the first index file + DataInputStream dataInputStream = + FileFactory.getDataInputStream(indexFilePath, FileFactory.getFileType(indexFilePath)); + byte[] bytes = new byte[(int) indexFile.getSize()]; + try { + //get the file in byte buffer + dataInputStream.readFully(bytes); + CarbonIndexFileReader indexReader = new CarbonIndexFileReader(); + // read from byte buffer. + indexReader.openThriftReader(bytes); + // get the index header + org.apache.carbondata.format.IndexHeader readIndexHeader = indexReader.readIndexHeader(); + List<ColumnSchema> columnSchemaList = new ArrayList<ColumnSchema>(); + List<org.apache.carbondata.format.ColumnSchema> table_columns = + readIndexHeader.getTable_columns(); + for (org.apache.carbondata.format.ColumnSchema columnSchema: table_columns) { --- End diff -- small code style: should add space before ':',like columnSchema : table_columns --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191110093 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema + * @param indexFilePath complete path including index file name + * @return null, if the index file is not present in the path. --- End diff -- Actually now it will not return null, It will throw exception. That why it is WIP. I will remove this from the method header. --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191110113 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema --- End diff -- OK --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191110214 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema + * @param indexFilePath complete path including index file name + * @return null, if the index file is not present in the path. + * List<ColumnSchema> from the index file. + * @throws IOException + */ + public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException { --- End diff -- This test case is redundant. As new API in this PR, just need path. It doesn't matter whether reader is transactional or not. Hence not added --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191110226 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema + * @param indexFilePath complete path including index file name + * @return null, if the index file is not present in the path. + * List<ColumnSchema> from the index file. + * @throws IOException + */ + public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException { + CarbonFile indexFile = + FileFactory.getCarbonFile(indexFilePath, FileFactory.getFileType(indexFilePath)); + if (!indexFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) { + throw new IOException("Not an index file name"); + } + // read schema from the first index file + DataInputStream dataInputStream = + FileFactory.getDataInputStream(indexFilePath, FileFactory.getFileType(indexFilePath)); + byte[] bytes = new byte[(int) indexFile.getSize()]; + try { + //get the file in byte buffer + dataInputStream.readFully(bytes); + CarbonIndexFileReader indexReader = new CarbonIndexFileReader(); + // read from byte buffer. + indexReader.openThriftReader(bytes); + // get the index header + org.apache.carbondata.format.IndexHeader readIndexHeader = indexReader.readIndexHeader(); + List<ColumnSchema> columnSchemaList = new ArrayList<ColumnSchema>(); + List<org.apache.carbondata.format.ColumnSchema> table_columns = + readIndexHeader.getTable_columns(); + for (org.apache.carbondata.format.ColumnSchema columnSchema: table_columns) { --- End diff -- OK. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2345 It seems the method you added is dead code since no one calls it except from test. --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:
https://github.com/apache/carbondata/pull/2345 @xuchuanyin : This is one of the exposed method in SDK reader now. External client can make use of it to get schema by passing index file path. You can refer the other methods in this file. This is not a dead code --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2345 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6149/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2345 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4987/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2345 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5132/ --- |
In reply to this post by qiuchenjian-2
Github user sounakr commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2345#discussion_r191328391 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java --- @@ -101,6 +108,45 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) { return reader.readSchema(); } + /** + * Read carbonindex file and return the schema. + * This API is better use when index file is present, + * as It is faster compared to readSchemaInDataFile when carbondata file is big in size + * + * @param indexFilePath complete path including index file name + * @return List<ColumnSchema> from the index file. + * @throws IOException + */ + public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException { --- End diff -- In case we have a performance result for comparison between reading from CarbonIndexFile and CarbonDataFile, and if reading from CarbonIndexFile Shows better output, then better to read from Index File in all places, i.e. whenever internally we infer schema. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2345 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5025/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2345 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6186/ --- |
Free forum by Nabble | Edit this page |