[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Reading two sdk writer outp...

classic Classic list List threaded Threaded
45 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Reading two sdk writer outp...

qiuchenjian-2
GitHub user ajantha-bhat opened a pull request:

    https://github.com/apache/carbondata/pull/2273

    [CARBONDATA-2442] Reading two sdk writer output with differnt schema should prompt exception

    [CARBONDATA-2442] Reading two sdk writer output with differnt schema should prompt exception
   
    **problem** : when two sdk writer output with differnt schema is placed in
    same folder for reading, output is not as expected. It has many null
    output.
   
    **root cause:** when multiple carbondata and indexx files is placed in same
    folder. table schema is inferred by first file.
    comparing table schema with all other index file schema validation is
    not present
   
    **solution:** comapre table schema with all other index file schema, if
    there is a mismatch throw exception
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed? NA
     
     - [ ] Any backward compatibility impacted? NA
     
     - [ ] Document update required? NA
   
     - [ ] Testing done
       added the test case in  TestNonTransactionalCarbonTable.scala
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.  NA
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajantha-bhat/carbondata schema_mismatch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2273.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2273
   
----
commit 03ab5f30909fe971e6221fdab83bdeb2f81e385c
Author: ajantha-bhat <ajanthabhat@...>
Date:   2018-05-05T11:29:44Z

    [CARBONDATA-2442] Reading two sdk writer output with differnt schema should prompt exception
   
    problem : when two sdk writer output with differnt schema is placed in
    same folder for reading, output is not as expected. It has many null
    output.
   
    root cause: when multiple carbondata and indexx files is placed in same
    folder. table schema is inferred by first file.
    comparing table schema with all other index file schema validation is
    not present
   
    solution: comapre table schema with all other index file schema, if
    there is a mismatch throw exception

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Reading two sdk writer output with...

qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4736/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Reading two sdk writer output with...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4499/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Reading two sdk writer output with...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5659/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5665/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4505/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4518/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5678/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Fixed: Reading two sdk writ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2273#discussion_r186696437
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/column/ColumnSchema.java ---
    @@ -342,6 +342,30 @@ public void setParentColumnTableRelations(
         return true;
       }
     
    +  /**
    +   * method to compare columnSchema,
    +   * other parameters along with just column name and column data type
    +   * @param obj
    +   * @return
    +   */
    +  public boolean equalsWithStrictCheck(Object obj) {
    +    if (!this.equals(obj)) {
    +      return false;
    +    }
    +    ColumnSchema other = (ColumnSchema) obj;
    +    if (!columnUniqueId.equals(other.columnUniqueId) ||
    +        (isDimensionColumn != other.isDimensionColumn) ||
    +        (scale != other.scale) ||
    +        (precision != other.precision) ||
    +        (isSortColumn != other.isSortColumn)) {
    +      return false;
    +    }
    +    if (encodingList.size() != other.encodingList.size()) {
    --- End diff --
   
    Better to check the encoding values also...this is a generic method and can be useful for other schenario


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Fixed: Reading two sdk writ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2273#discussion_r186725643
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/column/ColumnSchema.java ---
    @@ -342,6 +342,30 @@ public void setParentColumnTableRelations(
         return true;
       }
     
    +  /**
    +   * method to compare columnSchema,
    +   * other parameters along with just column name and column data type
    +   * @param obj
    +   * @return
    +   */
    +  public boolean equalsWithStrictCheck(Object obj) {
    +    if (!this.equals(obj)) {
    +      return false;
    +    }
    +    ColumnSchema other = (ColumnSchema) obj;
    +    if (!columnUniqueId.equals(other.columnUniqueId) ||
    +        (isDimensionColumn != other.isDimensionColumn) ||
    +        (scale != other.scale) ||
    +        (precision != other.precision) ||
    +        (isSortColumn != other.isSortColumn)) {
    +      return false;
    +    }
    +    if (encodingList.size() != other.encodingList.size()) {
    --- End diff --
   
    done. added


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5741/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5745/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4805/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4585/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2273: [CARBONDATA-2442] Fixed: Reading two sdk writer outp...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2273
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4808/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Fixed: Reading two sdk writ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kunal642 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2273#discussion_r186992068
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ---
    @@ -151,6 +154,33 @@ public CarbonTable getOrCreateCarbonTable(Configuration configuration) throws IO
         SegmentStatusManager segmentStatusManager = new SegmentStatusManager(identifier);
         SegmentStatusManager.ValidAndInvalidSegmentsInfo segments = segmentStatusManager
             .getValidAndInvalidSegments(loadMetadataDetails, this.readCommittedScope);
    +
    +    // For NonTransactional table, compare the schema of all index files with inferred schema.
    +    // If there is a mismatch throw exception. As all files must be of same schema.
    +    if (!carbonTable.getTableInfo().isTransactionalTable()) {
    +      SchemaConverter schemaConverter = new ThriftWrapperSchemaConverterImpl();
    +      for (Segment segment : segments.getValidSegments()) {
    +        Map<String, String> indexFiles = segment.getCommittedIndexFile();
    +        for (Map.Entry<String, String> indexFileEntry : indexFiles.entrySet()) {
    +          Path indexFile = new Path(indexFileEntry.getKey());
    +          org.apache.carbondata.format.TableInfo tableInfo = CarbonUtil.inferSchemaFromIndexFile(
    +              indexFile.toString(), carbonTable.getTableName());
    +          TableInfo wrapperTableInfo = schemaConverter.fromExternalToWrapperTableInfo(
    +              tableInfo, identifier.getDatabaseName(),
    +              identifier.getTableName(),
    +              identifier.getTablePath());
    +          List<ColumnSchema> indexFileColumnList =
    +              wrapperTableInfo.getFactTable().getListOfColumns();
    +          List<ColumnSchema> tableColumnList =
    +              carbonTable.getTableInfo().getFactTable().getListOfColumns();
    +          if (!compareColumnSchemaList(indexFileColumnList, tableColumnList)) {
    +            throw new IOException("All the files schema doesn't match. "
    --- End diff --
   
    I dont think throwing exception is the correct approach. Either we should not let the user write data with different schema or we should infer the schema from the first index file that was written and skip any index file with different schema while preparing splits.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2273: [CARBONDATA-2442] Fixed: Reading two sdk writ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kunal642 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2273#discussion_r186992244
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ---
    @@ -151,6 +154,33 @@ public CarbonTable getOrCreateCarbonTable(Configuration configuration) throws IO
         SegmentStatusManager segmentStatusManager = new SegmentStatusManager(identifier);
         SegmentStatusManager.ValidAndInvalidSegmentsInfo segments = segmentStatusManager
             .getValidAndInvalidSegments(loadMetadataDetails, this.readCommittedScope);
    +
    +    // For NonTransactional table, compare the schema of all index files with inferred schema.
    +    // If there is a mismatch throw exception. As all files must be of same schema.
    +    if (!carbonTable.getTableInfo().isTransactionalTable()) {
    +      SchemaConverter schemaConverter = new ThriftWrapperSchemaConverterImpl();
    +      for (Segment segment : segments.getValidSegments()) {
    +        Map<String, String> indexFiles = segment.getCommittedIndexFile();
    +        for (Map.Entry<String, String> indexFileEntry : indexFiles.entrySet()) {
    +          Path indexFile = new Path(indexFileEntry.getKey());
    +          org.apache.carbondata.format.TableInfo tableInfo = CarbonUtil.inferSchemaFromIndexFile(
    +              indexFile.toString(), carbonTable.getTableName());
    +          TableInfo wrapperTableInfo = schemaConverter.fromExternalToWrapperTableInfo(
    +              tableInfo, identifier.getDatabaseName(),
    +              identifier.getTableName(),
    +              identifier.getTablePath());
    +          List<ColumnSchema> indexFileColumnList =
    +              wrapperTableInfo.getFactTable().getListOfColumns();
    +          List<ColumnSchema> tableColumnList =
    +              carbonTable.getTableInfo().getFactTable().getListOfColumns();
    +          if (!compareColumnSchemaList(indexFileColumnList, tableColumnList)) {
    +            throw new IOException("All the files schema doesn't match. "
    --- End diff --
   
    I dont think throwing exception is the correct approach. Either we should not let the user write data with different schema or we should infer the schema from the first index file that was written and skip any index file with different schema while preparing splits.
   
    @sounakr @kumarvishal09 what do you think??


---
123