Apache CarbonData Dev Mailing List archive - ［Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

Apache CarbonData Dev Mailing List archive

［Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

Posted by lionel061201 on Dec 14, 2016; 2:12am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Carbondata-0-2-0-incubating-Issue-Report-Select-statement-return-error-when-add-String-column-in-whee-tp4379.html

Hi Dev team,
As discussed this afternoon, I've changed back to 0.2.0 version for the testing. Please ignore the former email about "error when save DF to carbondata file", that's on master branch.

Spark version: 1.6.0
System: Mac OS X EI Capitan(10.11.6)

[lucao]$ spark-shell --master local[*] --total-executor-cores 2 --executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-connector-java-5.1.40-bin.jar

In 0.2.0, I can successfully create table and load data into carbondata table

scala> cc.sql("create table if not exists default.mycarbon_00001(vin String, data_date String, work_model Double) stored by 'carbondata'")

scala> cc.sql("load data inpath'test2.csv' into table default.mycarbon_00001")

I can successfully run below query:

scala> cc.sql("select vin, count(*) from default.mycarbon_00001 group by vin").show

INFO 13-12 17:13:42,215 - Job 5 finished: show at <console>:42, took 0.732793 s

+-----------------+---+

| vin|_c1|

+-----------------+---+

|LSJW26760ES065247|464|

|LSJW26760GS018559|135|

|LSJW26761ES064611|104|

|LSJW26761FS090787| 45|

|LSJW26762ES051513| 40|

|LSJW26762FS075036|434|

|LSJW26763ES052363| 32|

|LSJW26763FS088491|305|

|LSJW26764ES064859|186|

|LSJW26764FS078696| 40|

|LSJW26765ES058651|171|

|LSJW26765FS072633|191|

|LSJW26765GS056837|467|

|LSJW26766FS070308| 79|

|LSJW26766GS050853|300|

|LSJW26767FS069913| 8|

|LSJW26767GS053454|286|

|LSJW26768FS062811| 16|

|LSJW26768GS051146| 97|

|LSJW26769FS062722|424|

+-----------------+---+

only showing top 20 rows

The error occurred when I add "vin" column into where clause:

scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where vin='LSJW26760ES065247' group by vin")

+-----------------+---+

| vin|_c1|

+-----------------+---+

|LSJW26760ES065247|464|

+-----------------+---+

>>> This one is OK... Actually as I tested, the first two value in the top 20 rows usually successed but for most of others it will return error.

For example :

scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where vin='LSJW26765GS056837' group by vin").show

>>>Log is coming:

<carbontest_lucao_20161213.log>

It is the same error I met at Dec. 6th. As I said in the WeChat Group before:

When the data set is 1000 rows, no above error occurred.

When the data set is 1M rows, some returned error, some didn't.

When the data set is 1.9 billion, all tests returned error.

### Attached the sample data set (1M rows) for your reference.

<<........I sent this email yesterday afternoon but it was rejected by apache mail server due to larger than 1000000 bytes, so remove the sample data file from attachment, if you need it please reply your personal email address........>>

Looking forward to your response.

Thanks & Best Regards,

Lionel