Hello All
We are observing incorrect query results with carbondata 1.1.1. Please find details below - *Datasets used -* TPC-H star schema based datasets ( http://www.cs.umb.edu/~poneil/StarSchemaB.PDF) *Query - * * select cCustKey,loCustKey from customer, lineorder where loCustkey = cCustKey* *How we load data -* We validated loading data through dataframe and "INSERT" statements and both ways produce incorrect results. I am putting one way here- *-- CREATE CUSTOMER TABLE* *carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName string, cAddress string, cCity string, cNation string, cRegion string, cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")* *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE customer OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")* *-- CREATE LINEORDER TABLE* *carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy String) STORED BY 'carbondata'")* *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO TABLE lineorder OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")* *Results with different version - * * 1.1.0 - *Provides correct results for above query. Validated with results from parquet. * 1.1.1 - *Built from this <https://github.com/apache/carbondata/tree/apache-carbondata-1.1.1-rc1>. Join is missing lots of rows compared to parquet. * 1.1.1 - *Built from source code available for download <https://dist.apache.org/repos/dist/release/carbondata/1.1.1/apache-carbondata-1.1.1-source-release.zip>. Join is missing lots of rows compared to parquet. * 1.2 - *Built from master branch. Generated correct results similar to parquet. *Debugging further - * 1. Row counts for both lineOrder and customer tables are same. 2. If I try to find out key column in carbondata vs parquet then it is matching as well - val cd = carbon.sql("select cCustKey from customer") //.distinct.count -- 30,000,000 val sp = spark.sql("select cCustKey from pcustomer") //.distinct.count -- 30,000,000 cd.intersect(sp) -- 30,000,000 (carbon data has same keys compared to parquet) val cd = carbon.sql("select loCustKey from lineorder") //.distinct.count -- 13,365,986 val sp = spark.sql("select loCustKey from plineorder") //.distinct.count -- 13,365,986 cd.intersect(sp) --13,365,986 (carbon data has same keys compared to parquet) Above query shows that carbondata customer and lineitem has same key values compared to parquet. However, when you run above join query, carbondata generates very small subset of expected rows. If we run filter query for any specific key then that also returns no results. Not sure why v1.1.1 is producing incorrect results. My guess is that carbondata is skipping rows that it shouldn't in v1.1.1. Any help and suggestions are very much appreciated!! Thanks in advance.. Thanks Swapnil Shinde |
Hi,
I have verified using tpch tables with 1 GB generated data. on 1.1.1 but I got below result. I don't have the exact schema as you mentioned but with original TPCH schema, I verified. 0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey) from customer, orders where o_Custkey = c_CustKey; +-------------------+-------------------+--+ | count(c_CustKey) | count(o_CustKey) | +-------------------+-------------------+--+ | 1500000 | 1500000 | +-------------------+-------------------+--+ On parquet with same data. 0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey) from customer, orders where o_Custkey = c_CustKey; +-------------------+-------------------+--+ | count(c_CustKey) | count(o_CustKey) | +-------------------+-------------------+--+ | 1500000 | 1500000 | +-------------------+-------------------+--+ Regards, Ravindra. On 23 August 2017 at 19:40, Swapnil Shinde <[hidden email]> wrote: > Hello All > We are observing incorrect query results with carbondata 1.1.1. Please > find details below - > > *Datasets used -* > TPC-H star schema based datasets (http://www.cs.umb.edu/~ > poneil/StarSchemaB.PDF) > *Query - * > * select cCustKey,loCustKey from customer, lineorder where loCustkey = > cCustKey* > *How we load data -* > We validated loading data through dataframe and "INSERT" statements > and both ways produce incorrect results. I am putting one way here- > > > *-- CREATE CUSTOMER TABLE* > > *carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName > string, cAddress string, cCity string, cNation string, cRegion string, > cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")* > > *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE > customer > OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")* > > > > *-- CREATE LINEORDER TABLE* > > *carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey > bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey > Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity > Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue > Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy > String) STORED BY 'carbondata'")* > > *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO > TABLE lineorder > OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")* > > > *Results with different version - * > > * 1.1.0 - *Provides correct results for above query. Validated with > results from parquet. > > * 1.1.1 - *Built from this > <https://github.com/apache/carbondata/tree/apache-carbondata-1.1.1-rc1>. > Join is missing lots of rows compared to parquet. > > * 1.1.1 - *Built from source code available for download > <https://dist.apache.org/repos/dist/release/carbondata/1.1.1/apache-carbondata-1.1.1-source-release.zip>. > Join is missing lots of rows compared to parquet. > > * 1.2 - *Built from master branch. Generated correct results similar > to parquet. > > > *Debugging further - * > > 1. Row counts for both lineOrder and customer tables are same. > > 2. If I try to find out key column in carbondata vs parquet then it is > matching as well - > > val cd = carbon.sql("select cCustKey from customer") > //.distinct.count -- 30,000,000 > > val sp = spark.sql("select cCustKey from pcustomer") > //.distinct.count -- 30,000,000 > > cd.intersect(sp) -- 30,000,000 (carbon data has same keys > compared to parquet) > > > > val cd = carbon.sql("select loCustKey from lineorder") > //.distinct.count -- 13,365,986 > > val sp = spark.sql("select loCustKey from plineorder") > //.distinct.count -- 13,365,986 > > cd.intersect(sp) --13,365,986 (carbon data has same keys > compared to parquet) > > > Above query shows that carbondata customer and lineitem has same key > values compared to parquet. > > However, when you run above join query, carbondata generates very small > subset of expected rows. If we run filter query for any specific key then > that also returns no results. > > > Not sure why v1.1.1 is producing incorrect results. My guess is that > carbondata is skipping rows that it shouldn't in v1.1.1. > > Any help and suggestions are very much appreciated!! Thanks in advance.. > > > > Thanks > > Swapnil Shinde > > > > > > > > > > > -- Thanks & Regards, Ravi |
Free forum by Nabble | Edit this page |