[POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

Swapnil Shinde
Hello All
    We are observing incorrect query results with carbondata 1.1.1. Please
find details below -

*Datasets used -*
     TPC-H star schema based datasets (
http://www.cs.umb.edu/~poneil/StarSchemaB.PDF)
*Query - *
*     select cCustKey,loCustKey from customer, lineorder where loCustkey =
cCustKey*
*How we load data -*
     We validated loading data through dataframe and "INSERT" statements
and both ways produce incorrect results. I am putting one way here-


*-- CREATE CUSTOMER TABLE*

*carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName
string, cAddress string, cCity string, cNation string, cRegion string,
cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")*

*carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE
customer
OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")*



*-- CREATE LINEORDER TABLE*

*carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey
bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey
Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity
Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue
Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy
String) STORED BY 'carbondata'")*

*carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO TABLE
lineorder
OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")*


*Results with different version - *

*   1.1.0 - *Provides correct results for above query. Validated with
results from parquet.

*   1.1.1 - *Built from this
<https://github.com/apache/carbondata/tree/apache-carbondata-1.1.1-rc1>.
Join is missing lots of rows compared to parquet.

*   1.1.1 - *Built from source code available for download
<https://dist.apache.org/repos/dist/release/carbondata/1.1.1/apache-carbondata-1.1.1-source-release.zip>.
Join is missing lots of rows compared to parquet.

*      1.2 - *Built from master branch. Generated correct results similar
to parquet.


*Debugging further - *

1. Row counts for both lineOrder and customer tables are same.

2. If I try to find out key column in carbondata vs parquet then it is
matching as well -

         val cd = carbon.sql("select cCustKey from customer")
//.distinct.count -- 30,000,000

         val sp = spark.sql("select cCustKey from pcustomer")
//.distinct.count -- 30,000,000

         cd.intersect(sp) -- 30,000,000 (carbon data has same keys compared
to parquet)



         val cd = carbon.sql("select loCustKey from lineorder")
//.distinct.count -- 13,365,986

         val sp = spark.sql("select loCustKey from plineorder")
//.distinct.count -- 13,365,986

         cd.intersect(sp) --13,365,986 (carbon data has same keys compared
to parquet)


Above query shows that carbondata customer and lineitem has same key values
compared to parquet.

However, when you run above join query, carbondata generates very small
subset of expected rows. If we run filter query for any specific key then
that also returns no results.


Not sure why v1.1.1 is producing incorrect results. My guess is that
carbondata is skipping rows that it shouldn't in v1.1.1.

Any help and suggestions are very much appreciated!! Thanks in advance..



Thanks

Swapnil Shinde
Reply | Threaded
Open this post in threaded view
|

Re: [POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

ravipesala
Hi,

I have verified using tpch tables with 1 GB generated data. on 1.1.1  but I
got below result. I don't have the exact schema as you mentioned but with
original TPCH schema, I verified.

0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey)
from customer, orders where o_Custkey = c_CustKey;
+-------------------+-------------------+--+
| count(c_CustKey)  | count(o_CustKey)  |
+-------------------+-------------------+--+
| 1500000           | 1500000           |
+-------------------+-------------------+--+


On parquet with same data.

0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey)
from customer, orders where o_Custkey = c_CustKey;
+-------------------+-------------------+--+
| count(c_CustKey)  | count(o_CustKey)  |
+-------------------+-------------------+--+
| 1500000           | 1500000           |
+-------------------+-------------------+--+


Regards,
Ravindra.

On 23 August 2017 at 19:40, Swapnil Shinde <[hidden email]> wrote:

> Hello All
>     We are observing incorrect query results with carbondata 1.1.1. Please
> find details below -
>
> *Datasets used -*
>      TPC-H star schema based datasets (http://www.cs.umb.edu/~
> poneil/StarSchemaB.PDF)
> *Query - *
> *     select cCustKey,loCustKey from customer, lineorder where loCustkey =
> cCustKey*
> *How we load data -*
>      We validated loading data through dataframe and "INSERT" statements
> and both ways produce incorrect results. I am putting one way here-
>
>
> *-- CREATE CUSTOMER TABLE*
>
> *carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName
> string, cAddress string, cCity string, cNation string, cRegion string,
> cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")*
>
> *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE
> customer
> OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")*
>
>
>
> *-- CREATE LINEORDER TABLE*
>
> *carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey
> bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey
> Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity
> Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue
> Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy
> String) STORED BY 'carbondata'")*
>
> *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO
> TABLE lineorder
> OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")*
>
>
> *Results with different version - *
>
> *   1.1.0 - *Provides correct results for above query. Validated with
> results from parquet.
>
> *   1.1.1 - *Built from this
> <https://github.com/apache/carbondata/tree/apache-carbondata-1.1.1-rc1>.
> Join is missing lots of rows compared to parquet.
>
> *   1.1.1 - *Built from source code available for download
> <https://dist.apache.org/repos/dist/release/carbondata/1.1.1/apache-carbondata-1.1.1-source-release.zip>.
> Join is missing lots of rows compared to parquet.
>
> *      1.2 - *Built from master branch. Generated correct results similar
> to parquet.
>
>
> *Debugging further - *
>
> 1. Row counts for both lineOrder and customer tables are same.
>
> 2. If I try to find out key column in carbondata vs parquet then it is
> matching as well -
>
>          val cd = carbon.sql("select cCustKey from customer")
> //.distinct.count -- 30,000,000
>
>          val sp = spark.sql("select cCustKey from pcustomer")
> //.distinct.count -- 30,000,000
>
>          cd.intersect(sp) -- 30,000,000 (carbon data has same keys
> compared to parquet)
>
>
>
>          val cd = carbon.sql("select loCustKey from lineorder")
> //.distinct.count -- 13,365,986
>
>          val sp = spark.sql("select loCustKey from plineorder")
> //.distinct.count -- 13,365,986
>
>          cd.intersect(sp) --13,365,986 (carbon data has same keys
> compared to parquet)
>
>
> Above query shows that carbondata customer and lineitem has same key
> values compared to parquet.
>
> However, when you run above join query, carbondata generates very small
> subset of expected rows. If we run filter query for any specific key then
> that also returns no results.
>
>
> Not sure why v1.1.1 is producing incorrect results. My guess is that
> carbondata is skipping rows that it shouldn't in v1.1.1.
>
> Any help and suggestions are very much appreciated!! Thanks in advance..
>
>
>
> Thanks
>
> Swapnil Shinde
>
>
>
>
>
>
>
>
>
>
>


--
Thanks & Regards,
Ravi