Hi Dev team,
As discussed this afternoon, I've changed back to 0.2.0 version for the testing. Please ignore the former email about "error when save DF to carbondata file", that's on master branch. Spark version: 1.6.0 System: Mac OS X EI Capitan(10.11.6) [lucao]$ spark-shell --master local[*] --total-executor-cores 2 --executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-c In 0.2.0, I can successfully create table and load data into carbondata table scala> cc.sql("create table if not exists default.mycarbon_00001(vin String, data_date String, work_model Double) stored by 'carbondata'") scala> cc.sql("load data inpath'test2.csv' into table default.mycarbon_00001") I can successfully run below query: scala> cc.sql("select vin, count(*) from default.mycarbon_00001 group by vin").show INFO 13-12 17:13:42,215 - Job 5 finished: show at <console>:42, took 0.732793 s +-----------------+---+ | vin|_c1| +-----------------+---+ |LSJW26760ES065247|464| |LSJW26760GS018559|135| |LSJW26761ES064611|104| |LSJW26761FS090787| 45| |LSJW26762ES051513| 40| |LSJW26762FS075036|434| |LSJW26763ES052363| 32| |LSJW26763FS088491|305| |LSJW26764ES064859|186| |LSJW26764FS078696| 40| |LSJW26765ES058651|171| |LSJW26765FS072633|191| |LSJW26765GS056837|467| |LSJW26766FS070308| 79| |LSJW26766GS050853|300| |LSJW26767FS069913| 8| |LSJW26767GS053454|286| |LSJW26768FS062811| 16| |LSJW26768GS051146| 97| |LSJW26769FS062722|424| +-----------------+---+ only showing top 20 rows The error occurred when I add "vin" column into where clause: scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where vin='LSJW26760ES065247' group by vin") +-----------------+---+ | vin|_c1| +-----------------+---+ |LSJW26760ES065247|464| +-----------------+---+ >>> This one is OK... Actually as I tested, the first two value in the top 20 rows usually successed but for most of others it will return error. For example : scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where vin='LSJW26765GS056837' group by vin").show >>>Log is coming: <carbontest_lucao_20161213. It is the same error I met at Dec. 6th. As I said in the WeChat Group before: When the data set is 1000 rows, no above error occurred. When the data set is 1M rows, some returned error, some didn't. When the data set is 1.9 billion, all tests returned error. ### Attached the sample data set (1M rows) for your reference. <<........I sent this email yesterday afternoon but it was rejected by apache mail server due to larger than 1000000 bytes, so remove the sample data file from attachment, if you need it please reply your personal email address........>> Looking forward to your response. Thanks & Best Regards, Lionel |
Hi,
I just uploaded the data file to Baidu: 链接: https://pan.baidu.com/s/1slERWL3 密码: m7kj Thanks, Lionel On Wed, Dec 14, 2016 at 10:12 AM, Lu Cao <[hidden email]> wrote: > Hi Dev team, > As discussed this afternoon, I've changed back to 0.2.0 version for the > testing. Please ignore the former email about "error when save DF to > carbondata file", that's on master branch. > > Spark version: 1.6.0 > System: Mac OS X EI Capitan(10.11.6) > > [lucao]$ spark-shell --master local[*] --total-executor-cores 2 > --executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-c > onnector-java-5.1.40-bin.jar > > In 0.2.0, I can successfully create table and load data into carbondata > table > > scala> cc.sql("create table if not exists default.mycarbon_00001(vin > String, data_date String, work_model Double) stored by 'carbondata'") > > scala> cc.sql("load data inpath'test2.csv' into table > default.mycarbon_00001") > > I can successfully run below query: > > scala> cc.sql("select vin, count(*) from default.mycarbon_00001 group > by vin").show > > INFO 13-12 17:13:42,215 - Job 5 finished: show at <console>:42, took > 0.732793 s > > +-----------------+---+ > > | vin|_c1| > > +-----------------+---+ > > |LSJW26760ES065247|464| > > |LSJW26760GS018559|135| > > |LSJW26761ES064611|104| > > |LSJW26761FS090787| 45| > > |LSJW26762ES051513| 40| > > |LSJW26762FS075036|434| > > |LSJW26763ES052363| 32| > > |LSJW26763FS088491|305| > > |LSJW26764ES064859|186| > > |LSJW26764FS078696| 40| > > |LSJW26765ES058651|171| > > |LSJW26765FS072633|191| > > |LSJW26765GS056837|467| > > |LSJW26766FS070308| 79| > > |LSJW26766GS050853|300| > > |LSJW26767FS069913| 8| > > |LSJW26767GS053454|286| > > |LSJW26768FS062811| 16| > > |LSJW26768GS051146| 97| > > |LSJW26769FS062722|424| > > +-----------------+---+ > > only showing top 20 rows > > The error occurred when I add "vin" column into where clause: > > scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where > vin='LSJW26760ES065247' group by vin") > > +-----------------+---+ > > | vin|_c1| > > +-----------------+---+ > > |LSJW26760ES065247|464| > > +-----------------+---+ > > >>> This one is OK... Actually as I tested, the *first two value* in the > top 20 rows usually successed but for most of others it will return error. > > For example : > > scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where > vin='LSJW26765GS056837' group by vin").show > > >>>Log is coming: > > <carbontest_lucao_20161213.log> > > > It is the same error I met at Dec. 6th. As I said in the WeChat Group > before: > > When the data set is 1000 rows, no above error occurred. > > When the data set is 1M rows, some returned error, some didn't. > > When the data set is 1.9 billion, all tests returned error. > > > *### Attached the sample data set (1M rows) for your reference.* > > <<........I sent this email yesterday afternoon but it was rejected by > apache mail server due to larger than 1000000 bytes, so remove the sample > data file from attachment, if you need it please reply your personal email > address........>> > > Looking forward to your response. > > > Thanks & Best Regards, > > Lionel > |
Looks like the apache mail server filtered the log attachment again........
>>>>>>>>>>>>> INFO 13-12 17:16:39,940 - main Query [SELECT VIN, COUNT(*) FROM DEFAULT.MYCARBON_00001 WHERE VIN='LSJW26765GS056837' GROUP BY VIN] INFO 13-12 17:16:39,945 - Parsing command: select vin, count(*) from default.mycarbon_00001 where vin='LSJW26765GS056837' group by vin INFO 13-12 17:16:39,946 - Parse Completed INFO 13-12 17:16:39,948 - Parsing command: select vin, count(*) from default.mycarbon_00001 where vin='LSJW26765GS056837' group by vin INFO 13-12 17:16:39,949 - Parse Completed INFO 13-12 17:16:39,951 - 0: get_table : db=default tbl=mycarbon_00001 INFO 13-12 17:16:39,951 - ugi=lucao ip=unknown-ip-addr cmd=get_table : db=default tbl=mycarbon_00001 res10: org.apache.spark.sql.DataFrame = [vin: string, _c1: bigint] scala> res10.show INFO 13-12 17:16:44,840 - main Starting to optimize plan INFO 13-12 17:16:44,863 - Cleaned accumulator 20 INFO 13-12 17:16:44,864 - Removed broadcast_14_piece0 on localhost:59141 in memory (size: 10.2 KB, free: 143.2 MB) INFO 13-12 17:16:44,865 - Cleaned accumulator 32 INFO 13-12 17:16:44,866 - Cleaned shuffle 2 INFO 13-12 17:16:44,866 - Cleaned accumulator 28 INFO 13-12 17:16:44,866 - Cleaned accumulator 27 INFO 13-12 17:16:44,866 - Cleaned accumulator 26 INFO 13-12 17:16:44,866 - Cleaned accumulator 25 INFO 13-12 17:16:44,866 - Cleaned accumulator 24 INFO 13-12 17:16:44,866 - Cleaned accumulator 23 INFO 13-12 17:16:44,866 - Cleaned accumulator 22 INFO 13-12 17:16:44,866 - Cleaned accumulator 21 INFO 13-12 17:16:44,910 - main ************************Total Number Rows In BTREE: 1 INFO 13-12 17:16:44,911 - main Total Time in retrieving the data reference nodeafter scanning the btree 0 Total number of data reference node for executing filter(s) 1 INFO 13-12 17:16:44,912 - main Total Time taken to ensure the required executors : 1 INFO 13-12 17:16:44,912 - main Time elapsed to allocate the required executors : 0 INFO 13-12 17:16:44,912 - main No.Of Blocks before Blocklet distribution: 1 INFO 13-12 17:16:44,912 - main No.Of Blocks after Blocklet distribution: 1 INFO 13-12 17:16:45,030 - Identified no.of.Blocks: 1,parallelism: 8 , no.of.nodes: 1, no.of.tasks: 1 INFO 13-12 17:16:45,030 - Node : localhost, No.Of Blocks : 1 INFO 13-12 17:16:45,048 - Starting job: show at <console>:42 INFO 13-12 17:16:45,048 - Registering RDD 44 (show at <console>:42) INFO 13-12 17:16:45,049 - Got job 9 (show at <console>:42) with 1 output partitions INFO 13-12 17:16:45,049 - Final stage: ResultStage 15 (show at <console>:42) INFO 13-12 17:16:45,049 - Parents of final stage: List(ShuffleMapStage 14) INFO 13-12 17:16:45,049 - Missing parents: List(ShuffleMapStage 14) INFO 13-12 17:16:45,049 - Submitting ShuffleMapStage 14 (MapPartitionsRDD[44] at show at <console>:42), which has no missing parents INFO 13-12 17:16:45,051 - Block broadcast_15 stored as values in memory (estimated size 18.3 KB, free 55.3 KB) INFO 13-12 17:16:45,052 - Block broadcast_15_piece0 stored as bytes in memory (estimated size 8.8 KB, free 64.1 KB) INFO 13-12 17:16:45,052 - Added broadcast_15_piece0 in memory on localhost:59141 (size: 8.8 KB, free: 143.2 MB) INFO 13-12 17:16:45,052 - Created broadcast 15 from broadcast at DAGScheduler.scala:1006 INFO 13-12 17:16:45,052 - Submitting 1 missing tasks from ShuffleMapStage 14 (MapPartitionsRDD[44] at show at <console>:42) INFO 13-12 17:16:45,052 - Adding task set 14.0 with 1 tasks INFO 13-12 17:16:45,053 - Starting task 0.0 in stage 14.0 (TID 212, localhost, partition 0,ANY, 4677 bytes) INFO 13-12 17:16:45,054 - Running task 0.0 in stage 14.0 (TID 212) INFO 13-12 17:16:45,056 - *************************/Users/lucao/MyDev/spark-1.6.0-bin-hadoop2.6/conf/carbon.properties INFO 13-12 17:16:45,056 - [Executor task launch worker-11][partitionID:00001;queryID:340277307449972_0] Query will be executed on table: mycarbon_00001 ERROR 13-12 17:16:45,059 - [Executor task launch worker-11][partitionID:00001;queryID:340277307449972_0] java.lang.NullPointerException at org.apache.carbondata.scan.result.iterator.AbstractDetailQueryResultIterator.intialiseInfos(AbstractDetailQueryResultIterator.java:117) at org.apache.carbondata.scan.result.iterator.AbstractDetailQueryResultIterator.<init>(AbstractDetailQueryResultIterator.java:107) at org.apache.carbondata.scan.result.iterator.DetailQueryResultIterator.<init>(DetailQueryResultIterator.java:43) at org.apache.carbondata.scan.executor.impl.DetailQueryExecutor.execute(DetailQueryExecutor.java:39) at org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:216) at org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ERROR 13-12 17:16:45,060 - Exception in task 0.0 in stage 14.0 (TID 212) java.lang.RuntimeException: Exception occurred in query execution.Please check logs. at scala.sys.package$.error(package.scala:27) at org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:226) at org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) WARN 13-12 17:16:45,062 - Lost task 0.0 in stage 14.0 (TID 212, localhost): java.lang.RuntimeException: Exception occurred in query execution.Please check logs. at scala.sys.package$.error(package.scala:27) at org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:226) at org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ERROR 13-12 17:16:45,062 - Task 0 in stage 14.0 failed 1 times; aborting job INFO 13-12 17:16:45,062 - Removed TaskSet 14.0, whose tasks have all completed, from pool INFO 13-12 17:16:45,063 - Cancelling stage 14 INFO 13-12 17:16:45,063 - ShuffleMapStage 14 (show at <console>:42) failed in 0.010 s INFO 13-12 17:16:45,063 - Job 9 failed: show at <console>:42, took 0.015582 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 212, localhost): java.lang.RuntimeException: Exception occurred in query execution.Please check logs. at scala.sys.package$.error(package.scala:27) at org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:226) at org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212) at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125) at org.apache.spark.sql.DataFrame.org $apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1537) at org.apache.spark.sql.DataFrame.org $apache$spark$sql$DataFrame$$collect(DataFrame.scala:1544) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1414) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1413) at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2138) at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413) at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495) at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:171) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:394) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:355) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:42) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:57) at $iwC$$iwC$$iwC.<init>(<console>:59) at $iwC$$iwC.<init>(<console>:61) at $iwC.<init>(<console>:63) at <init>(<console>:65) at .<init>(<console>:69) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Exception occurred in query execution.Please check logs. at scala.sys.package$.error(package.scala:27) at org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<init>(CarbonScanRDD.scala:226) at org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(CarbonScanRDD.scala:192) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) On Wed, Dec 14, 2016 at 10:18 AM, Lu Cao <[hidden email]> wrote: > Hi, > I just uploaded the data file to Baidu: > 链接: https://pan.baidu.com/s/1slERWL3 > 密码: m7kj > > Thanks, > Lionel > > On Wed, Dec 14, 2016 at 10:12 AM, Lu Cao <[hidden email]> wrote: > >> Hi Dev team, >> As discussed this afternoon, I've changed back to 0.2.0 version for the >> testing. Please ignore the former email about "error when save DF to >> carbondata file", that's on master branch. >> >> Spark version: 1.6.0 >> System: Mac OS X EI Capitan(10.11.6) >> >> [lucao]$ spark-shell --master local[*] --total-executor-cores 2 >> --executor-memory 1g --num-executors 2 --jars ~/MyDev/hive-1.1.1/lib/mysql-c >> onnector-java-5.1.40-bin.jar >> >> In 0.2.0, I can successfully create table and load data into carbondata >> table >> >> scala> cc.sql("create table if not exists default.mycarbon_00001(vin >> String, data_date String, work_model Double) stored by 'carbondata'") >> >> scala> cc.sql("load data inpath'test2.csv' into table >> default.mycarbon_00001") >> >> I can successfully run below query: >> >> scala> cc.sql("select vin, count(*) from default.mycarbon_00001 group >> by vin").show >> >> INFO 13-12 17:13:42,215 - Job 5 finished: show at <console>:42, took >> 0.732793 s >> >> +-----------------+---+ >> >> | vin|_c1| >> >> +-----------------+---+ >> >> |LSJW26760ES065247|464| >> >> |LSJW26760GS018559|135| >> >> |LSJW26761ES064611|104| >> >> |LSJW26761FS090787| 45| >> >> |LSJW26762ES051513| 40| >> >> |LSJW26762FS075036|434| >> >> |LSJW26763ES052363| 32| >> >> |LSJW26763FS088491|305| >> >> |LSJW26764ES064859|186| >> >> |LSJW26764FS078696| 40| >> >> |LSJW26765ES058651|171| >> >> |LSJW26765FS072633|191| >> >> |LSJW26765GS056837|467| >> >> |LSJW26766FS070308| 79| >> >> |LSJW26766GS050853|300| >> >> |LSJW26767FS069913| 8| >> >> |LSJW26767GS053454|286| >> >> |LSJW26768FS062811| 16| >> >> |LSJW26768GS051146| 97| >> >> |LSJW26769FS062722|424| >> >> +-----------------+---+ >> >> only showing top 20 rows >> >> The error occurred when I add "vin" column into where clause: >> >> scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where >> vin='LSJW26760ES065247' group by vin") >> >> +-----------------+---+ >> >> | vin|_c1| >> >> +-----------------+---+ >> >> |LSJW26760ES065247|464| >> >> +-----------------+---+ >> >> >>> This one is OK... Actually as I tested, the *first two value* in the >> top 20 rows usually successed but for most of others it will return error. >> >> For example : >> >> scala> cc.sql("select vin, count(*) from default.mycarbon_00001 where >> vin='LSJW26765GS056837' group by vin").show >> >> >>>Log is coming: >> >> <carbontest_lucao_20161213.log> >> >> >> It is the same error I met at Dec. 6th. As I said in the WeChat Group >> before: >> >> When the data set is 1000 rows, no above error occurred. >> >> When the data set is 1M rows, some returned error, some didn't. >> >> When the data set is 1.9 billion, all tests returned error. >> >> >> *### Attached the sample data set (1M rows) for your reference.* >> >> <<........I sent this email yesterday afternoon but it was rejected by >> apache mail server due to larger than 1000000 bytes, so remove the >> sample data file from attachment, if you need it please reply your personal >> email address........>> >> >> Looking forward to your response. >> >> >> Thanks & Best Regards, >> >> Lionel >> > > |
Free forum by Nabble | Edit this page |