Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2345: [wip] Improve Carbon Reader Schema reading pe...

Classic

List

66 messages Options

Options

1234

[GitHub] carbondata pull request #2345: [wip] Improve Carbon Reader Schema reading pe...

GitHub user ajantha-bhat opened a pull request:

https://github.com/apache/carbondata/pull/2345

[wip] Improve Carbon Reader Schema reading performance on S3

Problem : Currently carbon reader is reading schema from carbondata file. On s3 multiple IO happens as buffer size is small and data file size is big.

Solution: Read schema from index file and do once IO of index file with a buffer size equal to index file size.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed? NA

- [ ] Any backward compatibility impacted? NA

- [ ] Document update required? NA

- [ ] Testing done
Added UT

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ajantha-bhat/carbondata master_new

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2345.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2345

----
commit 59b248bbef97010dc2f5dc697400bb2f85799425
Author: ajantha-bhat <ajanthabhat@...>
Date: 2018-05-27T17:19:23Z

Improve Carbon Reader Schema reading on S3

----

---

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2345

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5106/

---

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2345

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6113/

---

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2345

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4951/

---

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191100739

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema
+ * @param indexFilePath complete path including index file name
+ * @return null, if the index file is not present in the path.
--- End diff --

Can you optimize the expression? and can you add test case for returning null if the index file is not present in the path?

---

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191100925

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema
--- End diff --

Can you add a empty line?

---

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191101114

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema
+ * @param indexFilePath complete path including index file name
+ * @return null, if the index file is not present in the path.
+ * List<ColumnSchema> from the index file.
+ * @throws IOException
+ */
+ public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
--- End diff --

Can you add test case for non Transactional Table for this API?

---

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191103497

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema
+ * @param indexFilePath complete path including index file name
+ * @return null, if the index file is not present in the path.
+ * List<ColumnSchema> from the index file.
+ * @throws IOException
+ */
+ public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
+ CarbonFile indexFile =
+ FileFactory.getCarbonFile(indexFilePath, FileFactory.getFileType(indexFilePath));
+ if (!indexFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) {
+ throw new IOException("Not an index file name");
+ }
+ // read schema from the first index file
+ DataInputStream dataInputStream =
+ FileFactory.getDataInputStream(indexFilePath, FileFactory.getFileType(indexFilePath));
+ byte[] bytes = new byte[(int) indexFile.getSize()];
+ try {
+ //get the file in byte buffer
+ dataInputStream.readFully(bytes);
+ CarbonIndexFileReader indexReader = new CarbonIndexFileReader();
+ // read from byte buffer.
+ indexReader.openThriftReader(bytes);
+ // get the index header
+ org.apache.carbondata.format.IndexHeader readIndexHeader = indexReader.readIndexHeader();
+ List<ColumnSchema> columnSchemaList = new ArrayList<ColumnSchema>();
+ List<org.apache.carbondata.format.ColumnSchema> table_columns =
+ readIndexHeader.getTable_columns();
+ for (org.apache.carbondata.format.ColumnSchema columnSchema: table_columns) {
--- End diff --

small code style: should add space before ':',like columnSchema : table_columns

---

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191110093

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema
+ * @param indexFilePath complete path including index file name
+ * @return null, if the index file is not present in the path.
--- End diff --

Actually now it will not return null, It will throw exception. That why it is WIP. I will remove this from the method header.

---

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191110113

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema
--- End diff --

OK

---

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191110214

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema
+ * @param indexFilePath complete path including index file name
+ * @return null, if the index file is not present in the path.
+ * List<ColumnSchema> from the index file.
+ * @throws IOException
+ */
+ public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
--- End diff --

This test case is redundant. As new API in this PR, just need path. It doesn't matter whether reader is transactional or not. Hence not added

---

[GitHub] carbondata pull request #2345: [WIP] Improve Carbon Reader Schema reading pe...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191110226

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,43 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema
+ * @param indexFilePath complete path including index file name
+ * @return null, if the index file is not present in the path.
+ * List<ColumnSchema> from the index file.
+ * @throws IOException
+ */
+ public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
+ CarbonFile indexFile =
+ FileFactory.getCarbonFile(indexFilePath, FileFactory.getFileType(indexFilePath));
+ if (!indexFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) {
+ throw new IOException("Not an index file name");
+ }
+ // read schema from the first index file
+ DataInputStream dataInputStream =
+ FileFactory.getDataInputStream(indexFilePath, FileFactory.getFileType(indexFilePath));
+ byte[] bytes = new byte[(int) indexFile.getSize()];
+ try {
+ //get the file in byte buffer
+ dataInputStream.readFully(bytes);
+ CarbonIndexFileReader indexReader = new CarbonIndexFileReader();
+ // read from byte buffer.
+ indexReader.openThriftReader(bytes);
+ // get the index header
+ org.apache.carbondata.format.IndexHeader readIndexHeader = indexReader.readIndexHeader();
+ List<ColumnSchema> columnSchemaList = new ArrayList<ColumnSchema>();
+ List<org.apache.carbondata.format.ColumnSchema> table_columns =
+ readIndexHeader.getTable_columns();
+ for (org.apache.carbondata.format.ColumnSchema columnSchema: table_columns) {
--- End diff --

OK.

---

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2345

It seems the method you added is dead code since no one calls it except from test.

---

[GitHub] carbondata issue #2345: [WIP] Improve Carbon Reader Schema reading performan...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on the issue:

https://github.com/apache/carbondata/pull/2345

@xuchuanyin : This is one of the exposed method in SDK reader now. External client can make use of it to get schema by passing index file path.

You can refer the other methods in this file. This is not a dead code

---

[GitHub] carbondata issue #2345: [CARBONDATA-2557] Improve Carbon Reader Schema readi...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2345

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6149/

---

[GitHub] carbondata issue #2345: [CARBONDATA-2557] Improve Carbon Reader Schema readi...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2345

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4987/

---

[GitHub] carbondata issue #2345: [CARBONDATA-2557] Improve Carbon Reader Schema readi...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2345

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5132/

---

[GitHub] carbondata pull request #2345: [CARBONDATA-2557] Improve Carbon Reader Schem...

In reply to this post by qiuchenjian-2

Github user sounakr commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2345#discussion_r191328391

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReader.java ---
@@ -101,6 +108,45 @@ public static CarbonReaderBuilder builder(String tablePath, String tableName) {
return reader.readSchema();
}

+ /**
+ * Read carbonindex file and return the schema.
+ * This API is better use when index file is present,
+ * as It is faster compared to readSchemaInDataFile when carbondata file is big in size
+ *
+ * @param indexFilePath complete path including index file name
+ * @return List<ColumnSchema> from the index file.
+ * @throws IOException
+ */
+ public static List<ColumnSchema> readSchemaInIndexFile(String indexFilePath) throws IOException {
--- End diff --

In case we have a performance result for comparison between reading from CarbonIndexFile and CarbonDataFile, and if reading from CarbonIndexFile Shows better output, then better to read from Index File in all places, i.e. whenever internally we infer schema.

---

[GitHub] carbondata issue #2345: [WIP][CARBONDATA-2557] Improve Carbon Reader Schema ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2345

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5025/

---

[GitHub] carbondata issue #2345: [WIP][CARBONDATA-2557] Improve Carbon Reader Schema ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2345

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6186/

---

1234