Apache CarbonData Dev Mailing List archive

[DISCUSSION] Improve insert process

Posted by cenyuhai11 on Oct 27, 2017; 3:13am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Improve-insert-process-tp24932.html

When I insert data into carbondata from one table, I should do as the
following:
1、select count(1) from table1
and then
2、insert into table table1 select * from table1

Why I should execute "select count(1) from table1" first?
because the number of tasks are compute by carbondata, it is releated to how
many executor hosts we have now!

I don't think it is the right way. We should let spark to control the number
of tasks.
set the parameter "mapred.max.splits.size" is a common way to adjust the
number of tasks.

Even when I do the step 2, some tasks still failed, it will increase the
insert time.

So I sugguest that don't adjust the number of tasks, just use the default
behavior of spark.
And then if there are small files, add a fast merge job(merge data at
blocket level, just as )

so we also need to set the default value of
"carbon.number.of.cores.while.loading" to 1

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/