Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Reading two sdk writer outp...

Classic

List

45 messages Options

Options

123

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Reading two sdk writer outp...

GitHub user ajantha-bhat opened a pull request:

https://github.com/apache/carbondata/pull/2273

[CARBONDATA-2442] Reading two sdk writer output with differnt schema should prompt exception

[CARBONDATA-2442] Reading two sdk writer output with differnt schema should prompt exception

**problem** : when two sdk writer output with differnt schema is placed in
same folder for reading, output is not as expected. It has many null
output.

**root cause:** when multiple carbondata and indexx files is placed in same
folder. table schema is inferred by first file.
comparing table schema with all other index file schema validation is
not present

**solution:** comapre table schema with all other index file schema, if
there is a mismatch throw exception

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed? NA

- [ ] Any backward compatibility impacted? NA

- [ ] Document update required? NA

- [ ] Testing done
added the test case in TestNonTransactionalCarbonTable.scala
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ajantha-bhat/carbondata schema_mismatch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2273.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2273

----
commit 03ab5f30909fe971e6221fdab83bdeb2f81e385c
Author: ajantha-bhat <ajanthabhat@...>
Date: 2018-05-05T11:29:44Z

[CARBONDATA-2442] Reading two sdk writer output with differnt schema should prompt exception

problem : when two sdk writer output with differnt schema is placed in
same folder for reading, output is not as expected. It has many null
output.

root cause: when multiple carbondata and indexx files is placed in same
folder. table schema is inferred by first file.
comparing table schema with all other index file schema validation is
not present

solution: comapre table schema with all other index file schema, if
there is a mismatch throw exception

----

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Reading two sdk writer output with...

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2273

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4736/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Reading two sdk writer output with...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4499/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Reading two sdk writer output with...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5659/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on the issue:

https://github.com/apache/carbondata/pull/2273

retest this please

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5665/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4505/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on the issue:

https://github.com/apache/carbondata/pull/2273

retest this please

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4518/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5678/

---

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Fixed: Reading two sdk writ...

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2273#discussion_r186696437

--- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/column/ColumnSchema.java ---
@@ -342,6 +342,30 @@ public void setParentColumnTableRelations(
return true;
}

+ /**
+ * method to compare columnSchema,
+ * other parameters along with just column name and column data type
+ * @param obj
+ * @return
+ */
+ public boolean equalsWithStrictCheck(Object obj) {
+ if (!this.equals(obj)) {
+ return false;
+ }
+ ColumnSchema other = (ColumnSchema) obj;
+ if (!columnUniqueId.equals(other.columnUniqueId) ||
+ (isDimensionColumn != other.isDimensionColumn) ||
+ (scale != other.scale) ||
+ (precision != other.precision) ||
+ (isSortColumn != other.isSortColumn)) {
+ return false;
+ }
+ if (encodingList.size() != other.encodingList.size()) {
--- End diff --

Better to check the encoding values also...this is a generic method and can be useful for other schenario

---

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Fixed: Reading two sdk writ...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2273#discussion_r186725643

--- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/column/ColumnSchema.java ---
@@ -342,6 +342,30 @@ public void setParentColumnTableRelations(
return true;
}

+ /**
+ * method to compare columnSchema,
+ * other parameters along with just column name and column data type
+ * @param obj
+ * @return
+ */
+ public boolean equalsWithStrictCheck(Object obj) {
+ if (!this.equals(obj)) {
+ return false;
+ }
+ ColumnSchema other = (ColumnSchema) obj;
+ if (!columnUniqueId.equals(other.columnUniqueId) ||
+ (isDimensionColumn != other.isDimensionColumn) ||
+ (scale != other.scale) ||
+ (precision != other.precision) ||
+ (isSortColumn != other.isSortColumn)) {
+ return false;
+ }
+ if (encodingList.size() != other.encodingList.size()) {
--- End diff --

done. added

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on the issue:

https://github.com/apache/carbondata/pull/2273

retest this please

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5741/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5745/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2273

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4805/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2273

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4585/

---

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2273

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4808/

---

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Fixed: Reading two sdk writ...

In reply to this post by qiuchenjian-2

Github user kunal642 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2273#discussion_r186992068

--- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ---
@@ -151,6 +154,33 @@ public CarbonTable getOrCreateCarbonTable(Configuration configuration) throws IO
SegmentStatusManager segmentStatusManager = new SegmentStatusManager(identifier);
SegmentStatusManager.ValidAndInvalidSegmentsInfo segments = segmentStatusManager
.getValidAndInvalidSegments(loadMetadataDetails, this.readCommittedScope);
+
+ // For NonTransactional table, compare the schema of all index files with inferred schema.
+ // If there is a mismatch throw exception. As all files must be of same schema.
+ if (!carbonTable.getTableInfo().isTransactionalTable()) {
+ SchemaConverter schemaConverter = new ThriftWrapperSchemaConverterImpl();
+ for (Segment segment : segments.getValidSegments()) {
+ Map<String, String> indexFiles = segment.getCommittedIndexFile();
+ for (Map.Entry<String, String> indexFileEntry : indexFiles.entrySet()) {
+ Path indexFile = new Path(indexFileEntry.getKey());
+ org.apache.carbondata.format.TableInfo tableInfo = CarbonUtil.inferSchemaFromIndexFile(
+ indexFile.toString(), carbonTable.getTableName());
+ TableInfo wrapperTableInfo = schemaConverter.fromExternalToWrapperTableInfo(
+ tableInfo, identifier.getDatabaseName(),
+ identifier.getTableName(),
+ identifier.getTablePath());
+ List<ColumnSchema> indexFileColumnList =
+ wrapperTableInfo.getFactTable().getListOfColumns();
+ List<ColumnSchema> tableColumnList =
+ carbonTable.getTableInfo().getFactTable().getListOfColumns();
+ if (!compareColumnSchemaList(indexFileColumnList, tableColumnList)) {
+ throw new IOException("All the files schema doesn't match. "
--- End diff --

I dont think throwing exception is the correct approach. Either we should not let the user write data with different schema or we should infer the schema from the first index file that was written and skip any index file with different schema while preparing splits.

---

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Fixed: Reading two sdk writ...

In reply to this post by qiuchenjian-2

Github user kunal642 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2273#discussion_r186992244

--- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ---
@@ -151,6 +154,33 @@ public CarbonTable getOrCreateCarbonTable(Configuration configuration) throws IO
SegmentStatusManager segmentStatusManager = new SegmentStatusManager(identifier);
SegmentStatusManager.ValidAndInvalidSegmentsInfo segments = segmentStatusManager
.getValidAndInvalidSegments(loadMetadataDetails, this.readCommittedScope);
+
+ // For NonTransactional table, compare the schema of all index files with inferred schema.
+ // If there is a mismatch throw exception. As all files must be of same schema.
+ if (!carbonTable.getTableInfo().isTransactionalTable()) {
+ SchemaConverter schemaConverter = new ThriftWrapperSchemaConverterImpl();
+ for (Segment segment : segments.getValidSegments()) {
+ Map<String, String> indexFiles = segment.getCommittedIndexFile();
+ for (Map.Entry<String, String> indexFileEntry : indexFiles.entrySet()) {
+ Path indexFile = new Path(indexFileEntry.getKey());
+ org.apache.carbondata.format.TableInfo tableInfo = CarbonUtil.inferSchemaFromIndexFile(
+ indexFile.toString(), carbonTable.getTableName());
+ TableInfo wrapperTableInfo = schemaConverter.fromExternalToWrapperTableInfo(
+ tableInfo, identifier.getDatabaseName(),
+ identifier.getTableName(),
+ identifier.getTablePath());
+ List<ColumnSchema> indexFileColumnList =
+ wrapperTableInfo.getFactTable().getListOfColumns();
+ List<ColumnSchema> tableColumnList =
+ carbonTable.getTableInfo().getFactTable().getListOfColumns();
+ if (!compareColumnSchemaList(indexFileColumnList, tableColumnList)) {
+ throw new IOException("All the files schema doesn't match. "
--- End diff --

I dont think throwing exception is the correct approach. Either we should not let the user write data with different schema or we should infer the schema from the first index file that was written and skip any index file with different schema while preparing splits.

@sounakr @kumarvishal09 what do you think??

---

123