[DISCUSSION] Improve insert process
Posted by
cenyuhai11 on
Oct 27, 2017; 3:13am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Improve-insert-process-tp24932.html
When I insert data into carbondata from one table, I should do as the
following:
1、select count(1) from table1
and then
2、insert into table table1 select * from table1
Why I should execute "select count(1) from table1" first?
because the number of tasks are compute by carbondata, it is releated to how
many executor hosts we have now!
I don't think it is the right way. We should let spark to control the number
of tasks.
set the parameter "mapred.max.splits.size" is a common way to adjust the
number of tasks.
Even when I do the step 2, some tasks still failed, it will increase the
insert time.
So I sugguest that don't adjust the number of tasks, just use the default
behavior of spark.
And then if there are small files, add a fast merge job(merge data at
blocket level, just as )
so we also need to set the default value of
"carbon.number.of.cores.while.loading" to 1
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/