Apache CarbonData Dev Mailing List archive

Re: insert into carbon table failed

Posted by ravipesala on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/insert-into-carbon-table-failed-tp9609p9615.html

Hi,

Carbodata launches one job per each node to sort the data at node level and
avoid shuffling. Internally it uses threads to use parallel load. Please
use carbon.number.of.cores.while.loading property in carbon.properties file
and set the number of cores it should use per machine while loading.
Carbondata sorts the data at each node level to maintain the Btree for
each node per segment. It improves the query performance by filtering
faster if we have Btree at node level instead of each block level.

1.Which version of Carbondata are you using?
2.There are memory issues in Carbondata-1.0 version and are fixed current
master.
3.And you can improve the performance by enabling enable.unsafe.sort=true in
carbon.properties file. But it is not supported if bucketing of columns are
enabled. We are planning to support unsafe sort load for bucketing also in
next version.

Please send the executor log to know about the error you are facing.

Regards,
Ravindra

On 25 March 2017 at 16:18, [hidden email] <[hidden email]> wrote:

> Hello!
>
> *0、The failure*
> When i insert into carbon table，i encounter failure。The failure is as
> follow:
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most
> recent failure: Lost task 0.3 in stage 2.0 (TID 1007, hd26):
> ExecutorLostFailure (executor 1 exited caused by one of the running tasks)
> Reason: Slave lost+details
>
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 1007, hd26): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Slave lost
> Driver stacktrace:
>
> the stage:
>
> *Step:*
> *1、start spark－shell*
> ./bin/spark-shell \
> --master yarn-client \
> --num-executors 5 \ (I tried to set this parameter range from 10 to
> 20,but the second job has only 5 tasks)
> --executor-cores 5 \
> --executor-memory 20G \
> --driver-memory 8G \
> --queue root.default \
> --jars /xxx.jar
>
> //spark-default.conf spark.default.parallelism=320
>
> import org.apache.spark.sql.CarbonContext
> val cc = new CarbonContext(sc, "hdfs://xxxx/carbonData/CarbonStore")
>
> *2、create table*
> cc.sql("CREATE TABLE IF NOT EXISTS xxxx_table (dt String,pt String,lst
> String,plat String,sty String,is_pay String,is_vip String,is_mpack
> String,scene String,status String,nw String,isc String,area String,spttag
> String,province String,isp String,city String,tv String,hwm String,pip
> String,fo String,sh String,mid String,user_id String,play_pv Int,spt_cnt
> Int,prg_spt_cnt Int) row format delimited fields terminated by '|' STORED
> BY 'carbondata' TBLPROPERTIES ('DICTIONARY_EXCLUDE'='pip,sh,
> mid,fo,user_id','DICTIONARY_INCLUDE'='dt,pt,lst,plat,sty,
> is_pay,is_vip,is_mpack,scene,status,nw,isc,area,spttag,
> province,isp,city,tv,hwm','NO_INVERTED_INDEX'='lst,plat,hwm,
> pip,sh,mid','BUCKETNUMBER'='10','BUCKETCOLUMNS'='fo')")
>
> //notes，set "fo" column BUCKETCOLUMNS is to join another table
> //the column distinct values are as follows:
>
>
> *3、insert into table*（xxxx_table_tmp is a hive extenal orc table，has 20
> 0000 0000 records）
> cc.sql("insert into xxxx_table select dt,pt,lst,plat,sty,is_pay,is_
> vip,is_mpack,scene,status,nw,isc,area,spttag,province,isp,
> city,tv,hwm,pip,fo,sh,mid,user_id ,play_pv,spt_cnt,prg_spt_cnt from
> xxxx_table_tmp where dt='2017-01-01'")
>
> *4、spark split sql into two jobs，the first finished succeeded, but the
> second failed:*
>
>
> *5、The second job stage:*
>
>
>
> *Question:*
> 1、Why the second job has only five jobs,but the first job has 994 jobs ?(
> note:My hadoop cluster has 5 datanode）
> I guess it caused the failure
> 2、In the sources,i find DataLoadPartitionCoalescer.class，is it means that
> "one datanode has only one partition ,and then the task is only one on the
> datanode"?
> 3、In the ExampleUtils class,"carbon.table.split.partition.enable" is set
> as follow,but i can not find "carbon.table.split.partition.enable" in
> other parts of the project。
> I set "carbon.table.split.partition.enable" to true, but the second
> job has only five jobs.How to use this property?
> ExampleUtils :
> // whether use table split partition
> // true -> use table split partition, support multiple partition
> loading
> // false -> use node split partition, support data load by host
> partition
> CarbonProperties.getInstance().addProperty("carbon.table.split.partition.enable",
> "false")
> 4、Insert into carbon table takes 3 hours ,but eventually failed 。How can
> i speed it.
> 5、in the spark-shell ,I tried to set this parameter range from 10 to
> 20,but the second job has only 5 tasks
> the other parameter executor-memory = 20G is enough?
>
> I need your help!Thank you very much!
>
> [hidden email]
>
> ------------------------------
> [hidden email]
>

--
Thanks & Regards,
Ravi