[Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

lionel061201
Hi Dev team,
As discussed this afternoon, I've changed back to 0.2.0 version for the testing. Please ignore the former email about "error when save DF to carbondata file", that's on master branch.

Spark version: 1.6.0
System: Mac OS X EI Capitan(10.11.6)

[lucao]$ spark-shell --master local[*] --total-executor-cores 2 --executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-connector-java-5.1.40-bin.jar


In 0.2.0, I can successfully create table and load data into carbondata table

    scala> cc.sql("create table if not exists default.mycarbon_00001(vin String, data_date String, work_model Double) stored by 'carbondata'")

    scala> cc.sql("load data inpath'test2.csv' into table default.mycarbon_00001")

I can successfully run below query:

   scala> cc.sql("select vin, count(*) from default.mycarbon_00001 group by vin").show

INFO  13-12 17:13:42,215 - Job 5 finished: show at <console>:42, took 0.732793 s

+-----------------+---+

|              vin|_c1|

+-----------------+---+

|LSJW26760ES065247|464|

|LSJW26760GS018559|135|

|LSJW26761ES064611|104|

|LSJW26761FS090787| 45|

|LSJW26762ES051513| 40|

|LSJW26762FS075036|434|

|LSJW26763ES052363| 32|

|LSJW26763FS088491|305|

|LSJW26764ES064859|186|

|LSJW26764FS078696| 40|

|LSJW26765ES058651|171|

|LSJW26765FS072633|191|

|LSJW26765GS056837|467|

|LSJW26766FS070308| 79|

|LSJW26766GS050853|300|

|LSJW26767FS069913|  8|

|LSJW26767GS053454|286|

|LSJW26768FS062811| 16|

|LSJW26768GS051146| 97|

|LSJW26769FS062722|424|

+-----------------+---+

only showing top 20 rows

The error occurred when I add "vin" column into where clause:

scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where vin='LSJW26760ES065247' group by vin")

+-----------------+---+

|              vin|_c1|

+-----------------+---+

|LSJW26760ES065247|464|

+-----------------+---+

>>> This one is OK... Actually as I tested, the first two value in the top 20 rows usually successed but for most of others it will return error.

For example :

scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where vin='LSJW26765GS056837' group by vin").show

>>>Log is coming:

<carbontest_lucao_20161213.log>


It is the same error I met at Dec. 6th. As I said in the WeChat Group before: 

       When the data set is 1000 rows, no above error occurred.

       When the data set is 1M rows, some returned error, some didn't.

       When the data set is 1.9 billion, all tests returned error.


### Attached the sample data set (1M rows) for your reference.

<<........I sent this email yesterday afternoon but it was rejected by apache mail server due to larger than 1000000 bytes, so remove the sample data file from attachment, if you need it please reply your personal email address........>>

Looking forward to your response.


Thanks & Best Regards,

Lionel

Reply | Threaded
Open this post in threaded view
|

Re: [Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

lionel061201
Hi,
I just uploaded the data file to Baidu:
链接: https://pan.baidu.com/s/1slERWL3
密码: m7kj

Thanks,
Lionel

On Wed, Dec 14, 2016 at 10:12 AM, Lu Cao <[hidden email]> wrote:

> Hi Dev team,
> As discussed this afternoon, I've changed back to 0.2.0 version for the
> testing. Please ignore the former email about "error when save DF to
> carbondata file", that's on master branch.
>
> Spark version: 1.6.0
> System: Mac OS X EI Capitan(10.11.6)
>
> [lucao]$ spark-shell --master local[*] --total-executor-cores 2
> --executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-c
> onnector-java-5.1.40-bin.jar
>
> In 0.2.0, I can successfully create table and load data into carbondata
> table
>
>     scala> cc.sql("create table if not exists default.mycarbon_00001(vin
> String, data_date String, work_model Double) stored by 'carbondata'")
>
>     scala> cc.sql("load data inpath'test2.csv' into table
> default.mycarbon_00001")
>
> I can successfully run below query:
>
>    scala> cc.sql("select vin, count(*) from default.mycarbon_00001 group
> by vin").show
>
> INFO  13-12 17:13:42,215 - Job 5 finished: show at <console>:42, took
> 0.732793 s
>
> +-----------------+---+
>
> |              vin|_c1|
>
> +-----------------+---+
>
> |LSJW26760ES065247|464|
>
> |LSJW26760GS018559|135|
>
> |LSJW26761ES064611|104|
>
> |LSJW26761FS090787| 45|
>
> |LSJW26762ES051513| 40|
>
> |LSJW26762FS075036|434|
>
> |LSJW26763ES052363| 32|
>
> |LSJW26763FS088491|305|
>
> |LSJW26764ES064859|186|
>
> |LSJW26764FS078696| 40|
>
> |LSJW26765ES058651|171|
>
> |LSJW26765FS072633|191|
>
> |LSJW26765GS056837|467|
>
> |LSJW26766FS070308| 79|
>
> |LSJW26766GS050853|300|
>
> |LSJW26767FS069913|  8|
>
> |LSJW26767GS053454|286|
>
> |LSJW26768FS062811| 16|
>
> |LSJW26768GS051146| 97|
>
> |LSJW26769FS062722|424|
>
> +-----------------+---+
>
> only showing top 20 rows
>
> The error occurred when I add "vin" column into where clause:
>
> scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where
> vin='LSJW26760ES065247' group by vin")
>
> +-----------------+---+
>
> |              vin|_c1|
>
> +-----------------+---+
>
> |LSJW26760ES065247|464|
>
> +-----------------+---+
>
> >>> This one is OK... Actually as I tested, the *first two value* in the
> top 20 rows usually successed but for most of others it will return error.
>
> For example :
>
> scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where
> vin='LSJW26765GS056837' group by vin").show
>
> >>>Log is coming:
>
> <carbontest_lucao_20161213.log>
>
>
> It is the same error I met at Dec. 6th. As I said in the WeChat Group
> before:
>
>        When the data set is 1000 rows, no above error occurred.
>
>        When the data set is 1M rows, some returned error, some didn't.
>
>        When the data set is 1.9 billion, all tests returned error.
>
>
> *### Attached the sample data set (1M rows) for your reference.*
>
> <<........I sent this email yesterday afternoon but it was rejected by
> apache mail server due to larger than 1000000 bytes, so remove the sample
> data file from attachment, if you need it please reply your personal email
> address........>>
>
> Looking forward to your response.
>
>
> Thanks & Best Regards,
>
> Lionel
>
Reply | Threaded
Open this post in threaded view
|

Re: [Carbondata-0.2.0-incubating][Issue Report] -- Select statement return error when add String column in where clause

lionel061201
Looks like the apache mail server filtered the log attachment again........

>>>>>>>>>>>>>

INFO  13-12 17:16:39,940 - main Query [SELECT VIN, COUNT(*) FROM
DEFAULT.MYCARBON_00001 WHERE VIN='LSJW26765GS056837' GROUP BY VIN]
INFO  13-12 17:16:39,945 - Parsing command: select vin, count(*) from
default.mycarbon_00001 where vin='LSJW26765GS056837' group by vin
INFO  13-12 17:16:39,946 - Parse Completed
INFO  13-12 17:16:39,948 - Parsing command: select vin, count(*) from
default.mycarbon_00001 where vin='LSJW26765GS056837' group by vin
INFO  13-12 17:16:39,949 - Parse Completed
INFO  13-12 17:16:39,951 - 0: get_table : db=default tbl=mycarbon_00001
INFO  13-12 17:16:39,951 - ugi=lucao ip=unknown-ip-addr cmd=get_table :
db=default tbl=mycarbon_00001
res10: org.apache.spark.sql.DataFrame = [vin: string, _c1: bigint]

scala> res10.show
INFO  13-12 17:16:44,840 - main Starting to optimize plan
INFO  13-12 17:16:44,863 - Cleaned accumulator 20
INFO  13-12 17:16:44,864 - Removed broadcast_14_piece0 on localhost:59141
in memory (size: 10.2 KB, free: 143.2 MB)
INFO  13-12 17:16:44,865 - Cleaned accumulator 32
INFO  13-12 17:16:44,866 - Cleaned shuffle 2
INFO  13-12 17:16:44,866 - Cleaned accumulator 28
INFO  13-12 17:16:44,866 - Cleaned accumulator 27
INFO  13-12 17:16:44,866 - Cleaned accumulator 26
INFO  13-12 17:16:44,866 - Cleaned accumulator 25
INFO  13-12 17:16:44,866 - Cleaned accumulator 24
INFO  13-12 17:16:44,866 - Cleaned accumulator 23
INFO  13-12 17:16:44,866 - Cleaned accumulator 22
INFO  13-12 17:16:44,866 - Cleaned accumulator 21
INFO  13-12 17:16:44,910 - main ************************Total Number Rows
In BTREE: 1
INFO  13-12 17:16:44,911 - main Total Time in retrieving the data reference
nodeafter scanning the btree 0 Total number of data reference node for
executing filter(s) 1
INFO  13-12 17:16:44,912 - main Total Time taken to ensure the required
executors : 1
INFO  13-12 17:16:44,912 - main Time elapsed to allocate the required
executors : 0
INFO  13-12 17:16:44,912 - main No.Of Blocks before Blocklet distribution: 1
INFO  13-12 17:16:44,912 - main No.Of Blocks after Blocklet distribution: 1
INFO  13-12 17:16:45,030 - Identified  no.of.Blocks: 1,parallelism: 8 ,
no.of.nodes: 1, no.of.tasks: 1
INFO  13-12 17:16:45,030 - Node : localhost, No.Of Blocks : 1
INFO  13-12 17:16:45,048 - Starting job: show at <console>:42
INFO  13-12 17:16:45,048 - Registering RDD 44 (show at <console>:42)
INFO  13-12 17:16:45,049 - Got job 9 (show at <console>:42) with 1 output
partitions
INFO  13-12 17:16:45,049 - Final stage: ResultStage 15 (show at
<console>:42)
INFO  13-12 17:16:45,049 - Parents of final stage: List(ShuffleMapStage 14)
INFO  13-12 17:16:45,049 - Missing parents: List(ShuffleMapStage 14)
INFO  13-12 17:16:45,049 - Submitting ShuffleMapStage 14
(MapPartitionsRDD[44] at show at <console>:42), which has no missing parents
INFO  13-12 17:16:45,051 - Block broadcast_15 stored as values in memory
(estimated size 18.3 KB, free 55.3 KB)
INFO  13-12 17:16:45,052 - Block broadcast_15_piece0 stored as bytes in
memory (estimated size 8.8 KB, free 64.1 KB)
INFO  13-12 17:16:45,052 - Added broadcast_15_piece0 in memory on
localhost:59141 (size: 8.8 KB, free: 143.2 MB)
INFO  13-12 17:16:45,052 - Created broadcast 15 from broadcast at
DAGScheduler.scala:1006
INFO  13-12 17:16:45,052 - Submitting 1 missing tasks from ShuffleMapStage
14 (MapPartitionsRDD[44] at show at <console>:42)
INFO  13-12 17:16:45,052 - Adding task set 14.0 with 1 tasks
INFO  13-12 17:16:45,053 - Starting task 0.0 in stage 14.0 (TID 212,
localhost, partition 0,ANY, 4677 bytes)
INFO  13-12 17:16:45,054 - Running task 0.0 in stage 14.0 (TID 212)
INFO  13-12 17:16:45,056 -
*************************/Users/lucao/MyDev/spark-1.6.0-bin-hadoop2.6/conf/carbon.properties
INFO  13-12 17:16:45,056 - [Executor task launch
worker-11][partitionID:00001;queryID:340277307449972_0] Query will be
executed on table: mycarbon_00001
ERROR 13-12 17:16:45,059 - [Executor task launch
worker-11][partitionID:00001;queryID:340277307449972_0]
java.lang.NullPointerException
at
org.apache.carbondata.scan.result.iterator.AbstractDetailQueryResultIterator.intialiseInfos(AbstractDetailQueryResultIterator.java:117)
at
org.apache.carbondata.scan.result.iterator.AbstractDetailQueryResultIterator.<init>(AbstractDetailQueryResultIterator.java:107)
at
org.apache.carbondata.scan.result.iterator.DetailQueryResultIterator.<init>(DetailQueryResultIterator.java:43)
at
org.apache.carbondata.scan.executor.impl.DetailQueryExecutor.execute(DetailQueryExecutor.java:39)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:216)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
ERROR 13-12 17:16:45,060 - Exception in task 0.0 in stage 14.0 (TID 212)
java.lang.RuntimeException: Exception occurred in query execution.Please
check logs.
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:226)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
WARN  13-12 17:16:45,062 - Lost task 0.0 in stage 14.0 (TID 212,
localhost): java.lang.RuntimeException: Exception occurred in query
execution.Please check logs.
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:226)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR 13-12 17:16:45,062 - Task 0 in stage 14.0 failed 1 times; aborting job
INFO  13-12 17:16:45,062 - Removed TaskSet 14.0, whose tasks have all
completed, from pool
INFO  13-12 17:16:45,063 - Cancelling stage 14
INFO  13-12 17:16:45,063 - ShuffleMapStage 14 (show at <console>:42) failed
in 0.010 s
INFO  13-12 17:16:45,063 - Job 9 failed: show at <console>:42, took
0.015582 s
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage
14.0 (TID 212, localhost): java.lang.RuntimeException: Exception occurred
in query execution.Please check logs.
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:226)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125)
at org.apache.spark.sql.DataFrame.org
$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1537)
at org.apache.spark.sql.DataFrame.org
$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1544)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1414)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2138)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:171)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:394)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:355)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:42)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
at $iwC$$iwC$$iwC.<init>(<console>:59)
at $iwC$$iwC.<init>(<console>:61)
at $iwC.<init>(<console>:63)
at <init>(<console>:65)
at .<init>(<console>:69)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Exception occurred in query
execution.Please check logs.
at scala.sys.package$.error(package.scala:27)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:226)
at
org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

On Wed, Dec 14, 2016 at 10:18 AM, Lu Cao <[hidden email]> wrote:

> Hi,
> I just uploaded the data file to Baidu:
> 链接: https://pan.baidu.com/s/1slERWL3
> 密码: m7kj
>
> Thanks,
> Lionel
>
> On Wed, Dec 14, 2016 at 10:12 AM, Lu Cao <[hidden email]> wrote:
>
>> Hi Dev team,
>> As discussed this afternoon, I've changed back to 0.2.0 version for the
>> testing. Please ignore the former email about "error when save DF to
>> carbondata file", that's on master branch.
>>
>> Spark version: 1.6.0
>> System: Mac OS X EI Capitan(10.11.6)
>>
>> [lucao]$ spark-shell --master local[*] --total-executor-cores 2
>> --executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-c
>> onnector-java-5.1.40-bin.jar
>>
>> In 0.2.0, I can successfully create table and load data into carbondata
>> table
>>
>>     scala> cc.sql("create table if not exists default.mycarbon_00001(vin
>> String, data_date String, work_model Double) stored by 'carbondata'")
>>
>>     scala> cc.sql("load data inpath'test2.csv' into table
>> default.mycarbon_00001")
>>
>> I can successfully run below query:
>>
>>    scala> cc.sql("select vin, count(*) from default.mycarbon_00001 group
>> by vin").show
>>
>> INFO  13-12 17:13:42,215 - Job 5 finished: show at <console>:42, took
>> 0.732793 s
>>
>> +-----------------+---+
>>
>> |              vin|_c1|
>>
>> +-----------------+---+
>>
>> |LSJW26760ES065247|464|
>>
>> |LSJW26760GS018559|135|
>>
>> |LSJW26761ES064611|104|
>>
>> |LSJW26761FS090787| 45|
>>
>> |LSJW26762ES051513| 40|
>>
>> |LSJW26762FS075036|434|
>>
>> |LSJW26763ES052363| 32|
>>
>> |LSJW26763FS088491|305|
>>
>> |LSJW26764ES064859|186|
>>
>> |LSJW26764FS078696| 40|
>>
>> |LSJW26765ES058651|171|
>>
>> |LSJW26765FS072633|191|
>>
>> |LSJW26765GS056837|467|
>>
>> |LSJW26766FS070308| 79|
>>
>> |LSJW26766GS050853|300|
>>
>> |LSJW26767FS069913|  8|
>>
>> |LSJW26767GS053454|286|
>>
>> |LSJW26768FS062811| 16|
>>
>> |LSJW26768GS051146| 97|
>>
>> |LSJW26769FS062722|424|
>>
>> +-----------------+---+
>>
>> only showing top 20 rows
>>
>> The error occurred when I add "vin" column into where clause:
>>
>> scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where
>> vin='LSJW26760ES065247' group by vin")
>>
>> +-----------------+---+
>>
>> |              vin|_c1|
>>
>> +-----------------+---+
>>
>> |LSJW26760ES065247|464|
>>
>> +-----------------+---+
>>
>> >>> This one is OK... Actually as I tested, the *first two value* in the
>> top 20 rows usually successed but for most of others it will return error.
>>
>> For example :
>>
>> scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where
>> vin='LSJW26765GS056837' group by vin").show
>>
>> >>>Log is coming:
>>
>> <carbontest_lucao_20161213.log>
>>
>>
>> It is the same error I met at Dec. 6th. As I said in the WeChat Group
>> before:
>>
>>        When the data set is 1000 rows, no above error occurred.
>>
>>        When the data set is 1M rows, some returned error, some didn't.
>>
>>        When the data set is 1.9 billion, all tests returned error.
>>
>>
>> *### Attached the sample data set (1M rows) for your reference.*
>>
>> <<........I sent this email yesterday afternoon but it was rejected by
>> apache mail server due to larger than 1000000 bytes, so remove the
>> sample data file from attachment, if you need it please reply your personal
>> email address........>>
>>
>> Looking forward to your response.
>>
>>
>> Thanks & Best Regards,
>>
>> Lionel
>>
>
>