insert carbondata table failed

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view

insert carbondata table failed



   It inserts records from a source table into a target CarbonData table(kc22_ca). The source table can be a Hive table(‘kc22_p1’).

kc22_p1 records : 102200946  51.5 G


spark-shell --master yarn-client --driver-memory 20G --executor-cores 1 --num-executors 12  --executor-memory 5G


val cc = new CarbonContext(sc, "hdfs://cluster1/opt/CarbonStore")


cc.sql("create table if not exists kc22_ca (akb020 String,akc190 String,aae072 String,akc220 String,ake005 String,bka135 String,bkc301 String,ake001 String,ake002 String,ake006 String,akc221 String,ake010 String,aka065 String,ake003 String,aka063 String,akc225 double,akc226 double,aae019 double,akc228 double,ake051 double,aka068 double,akc268 double,bkc228 double,bka635 double,aka069 double,bka107 double,bka108 double,bkc127 String,aka064 String,aae100 String,bkc126 String,bkc125 String,bka231 String,bae073 double,bka636 double,bka637 double,bka104 double,bka609 String,aka070 String,aka067 String,aka074 String,bkc378 String,bkc379 String,bkc380 String,bkc381 String,aae011 String,aae036 String,bkc319 double,bkf050 String,akc273 String,aka071 double,aka072 String,aka107 String,bka076 String,akf002 String,bkc241 double,bkc242 String,bkc243 String,bka205 String,bkb401 String,bka650 double,bka651 String,aka130 String,aka120 String,bae075 double,aae017 String,aae032 String,bkc060 double,bkc061 double,bkc062 double,bkc063 double,bkc064 double,bkc065 double,bkc066 String,bkc067 String,bkc068 String,bkc069 String,baz001 double,baz002 double,bze011 String,bze036 String,aaa027 String,aab034 String,aac001 double,bkb070 String,bkb071 String,bkc077 String,bkc078 String,bkc079 String,bkc081 double,bka610 String,bka971 double,bka972 double,bka973 String,bka974 String) STORED BY 'carbondata' TBLPROPERTIES('DICTIONARY_INCLUD'='akb020,  aae072, bka135, akc220, ake005, bkc301','DICTIONARY_EXCLUDE'='akc190,ake001,ake002,ake006,akc221,ake010,aka065,ake003,aka063,bkc127,aka064,aae100,bkc126,bkc125,bka231,bka609,aka070,aka067,aka074,bkc378,bkc379,bkc380,bkc381,aae011,aae036,bkf050,akc273,aka072,aka107,bka076,akf002,bkc242,bkc243,bka205,bkb401,bka651,aka130,aka120,aae017,aae032,bkc066,bkc067,bkc068,bkc069,bze011,bze036,aaa027,aab034,bkb070,bkb071,bkc077,bkc078,bkc079,bka610,bka973,bka974')")


note: When using only DICTIONARY_INCLUDE and the two are used together, the amount of shuffle is not the same.

Reference annex



17/09/19 09:29:51 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 1039) in 8523 ms on node2 (3/7)

17/09/19 09:30:13 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1036) in 30754 ms on node2 (4/7)

17/09/19 09:30:18 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 1037) in 35309 ms on node1 (5/7)

17/09/19 09:33:49 WARN HeartbeatReceiver: Removing executor 5 with no recent heartbeats: 135938 ms exceeds timeout 120000 ms

17/09/19 09:33:49 ERROR YarnScheduler: Lost executor 5 on node1: Executor heartbeat timed out after 135938 ms

17/09/19 09:33:49 WARN TaskSetManager: Lost task 6.0 in stage 1.0 (TID 1041, node1): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 135938 ms

17/09/19 09:33:49 INFO TaskSetManager: Starting task 6.1 in stage 1.0 (TID 1042, node3, partition 6,PROCESS_LOCAL, 1894 bytes)

17/09/19 09:33:49 INFO DAGScheduler: Executor lost: 5 (epoch 1)

17/09/19 09:33:49 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 5

17/09/19 09:33:49 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.

17/09/19 09:33:49 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, node1, 58006)

17/09/19 09:33:49 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor

17/09/19 09:33:49 INFO ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 5 (917/1035, false)

17/09/19 09:33:49 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on node3:57113 (size: 3.7 KB, free: 4.1 GB)

17/09/19 09:33:49 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to node3:33757

17/09/19 09:33:49 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 5830 bytes

17/09/19 09:33:49 WARN TaskSetManager: Lost task 6.1 in stage 1.0 (TID 1042, node3): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=6, message=

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

        at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:542)

        at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:538)

        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

        at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:538)

        at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:155)


        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

        at org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD$$anon$1.<init>(CarbonGlobalDictionaryRDD.scala:372)

        at org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD.compute(CarbonGlobalDictionaryRDD.scala:345)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)


        at org.apache.spark.executor.Executor$

        at java.util.concurrent.ThreadPoolExecutor.runWorker(

        at java.util.concurrent.ThreadPoolExecutor$




17/09/19 09:33:49 INFO DAGScheduler: Marking ResultStage 1 (collect at GlobalDictionaryUtil.scala:746) as failed due to a fetch failure from ShuffleMapStage 0 (RDD at CarbonGlobalDictionaryRDD.scala:271)

17/09/19 09:33:49 INFO DAGScheduler: ResultStage 1 (collect at GlobalDictionaryUtil.scala:746) failed in 247.083 s

17/09/19 09:33:49 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (RDD at CarbonGlobalDictionaryRDD.scala:271) and ResultStage 1 (collect at GlobalDictionaryUtil.scala:746) due to fetch failure

17/09/19 09:33:50 INFO DAGScheduler: Resubmitting failed stages

17/09/19 09:33:50 INFO DAGScheduler: Submitting ShuffleMapStage 0 (CarbonBlockDistinctValuesCombineRDD[11] at RDD at CarbonGlobalDictionaryRDD.scala:271), which has no missing parents

17/09/19 09:33:50 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 15.0 KB, free 1291.9 KB)

17/09/19 09:33:50 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 7.1 KB, free 1299.0 KB)

17/09/19 09:33:50 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on (size: 7.1 KB, free: 14.2 GB)

17/09/19 09:33:50 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006

17/09/19 09:33:50 INFO DAGScheduler: Submitting 118 missing tasks from ShuffleMapStage 0 (CarbonBlockDistinctValuesCombineRDD[11] at RDD at CarbonGlobalDictionaryRDD.scala:271)










Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s)
is intended only for the use of the intended recipient and may be confidential and/or privileged of
Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader of this communication is
not the intended recipient, unauthorized use, forwarding, printing,  storing, disclosure or copying
is strictly prohibited, and may be unlawful.If you have received this communication in error,please
immediately notify the sender by return e-mail, and delete the original message and all copies from
your system. Thank you.

Reply | Threaded
Open this post in threaded view

Re: insert carbondata table failed


I don't get much from the logs but the error seems related to memory issue
from Spark. From your old emails I get that you are using 3 node cluster. Is
that all 3 node has nodemanager and datanodes?
So better give only less number of executors and provide more memory to it
like below. While data loading it is recommended to use one executor per

spark-shell --master yarn-client --driver-memory 10G --executor-cores 4
--num-executors 3  --executor-memory 25G

And also if any configuration gives any error please provide the executor

Thank you,

Sent from:
Reply | Threaded
Open this post in threaded view

答复: insert carbondata table failed


A total of 4 nodes . of which 3 as datanode and snn on one of the datanodes.

Carbondata 1.1.0
Spark 1.6.0
Hadoop :2.7.2

Thank you for your help , I'm trying again
Liu feng

发件人: ravipesala [mailto:[hidden email]]
发送时间: 2017年9月19日 11:23
收件人: [hidden email]
主题: Re: insert carbondata table failed


I don't get much from the logs but the error seems related to memory issue
from Spark. From your old emails I get that you are using 3 node cluster. Is
that all 3 node has nodemanager and datanodes?
So better give only less number of executors and provide more memory to it
like below. While data loading it is recommended to use one executor per

spark-shell --master yarn-client --driver-memory 10G --executor-cores 4
--num-executors 3  --executor-memory 25G

And also if any configuration gives any error please provide the executor

Thank you,

Sent from:

Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s)
is intended only for the use of the intended recipient and may be confidential and/or privileged of
Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader of this communication is
not the intended recipient, unauthorized use, forwarding, printing,  storing, disclosure or copying
is strictly prohibited, and may be unlawful.If you have received this communication in error,please
immediately notify the sender by return e-mail, and delete the original message and all copies from
your system. Thank you.
Reply | Threaded
Open this post in threaded view

Re: insert carbondata table failed

In reply to this post by 刘feng
Reply | Threaded
Open this post in threaded view

答复: insert carbondata table failed

Thank you ,
  I have tried to resolve this issue by making changes in the spark
configuration and use two fields as DICTIONARY_INCLUDE.
  test data(30G) load 8 times, each time about 1.5 minutes to complete

 Is currently testing another larger data, hope to be successful, thank you
very much for the help!
Liu feng

发件人: manishgupta88 [mailto:[hidden email]]
发送时间: 2017年9月19日 13:27
收件人: [hidden email]
主题: Re: insert carbondata table failed

Hi Feng,

You can also refer the below links wherein the spark users have tried to
resolve this issue by making changes in the configuration. This might help

Manish Gupta

Sent from:

Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s)
is intended only for the use of the intended recipient and may be confidential and/or privileged of
Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader of this communication is
not the intended recipient, unauthorized use, forwarding, printing,  storing, disclosure or copying
is strictly prohibited, and may be unlawful.If you have received this communication in error,please
immediately notify the sender by return e-mail, and delete the original message and all copies from
your system. Thank you.