Apache CarbonData Dev Mailing List archive

回复： [DISCUSSION] Improve insert process

Posted by cenyuhai11 on Oct 30, 2017; 10:01am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Improve-insert-process-tp24932p25209.html

this still can not resolve my problem. The number of tasks are not stable. sometimes only 10 tasks... Can 10 tasks work? I think maybe it can, but it will take a long time...

Best regards!
Yuhai Cen

在2017年10月30日 13:21，Jacky Li<[hidden email]> 写道：

在 2017年10月30日，上午9:47，岑玉海 <[hidden email]> 写道：

why user need to set "carbon.number.of.cores.while.loading", because the loading process is too slow. The key point is slow!

Yes I agree. What I am suggesting in the other mail is that carbon should take care of it if the user does not set “carbon.number.of.cores.while.loading”.
1. Carbon should take the minimal value of “carbon.number.of.cores.while.loading” property and “spark.executor.cores” in spark conf, instead of taking “carbon.number.of.cores.while.loading” directly.
2. The default value of “carbon.number.of.cores.while.loading” should be big, so that carbon has higher chance to take “spark.executor.cores” as the loading cores. The current default value is 2 which I think is too small.

I think it can solve your problem after this change.

I also have a problem that I should increase the value of "spark.dynamicAllocation.maxExecutors" when the data become larger.....

Best regards!
Yuhai Cen

在2017年10月28日 16:11，Jacky Li<[hidden email]> 写道：

I think about this issue, and find actually there is an usability problem carbon can improve.

The problem is that, currently user is forced to set “carbon.number.of.cores.while.loading” carbon property before loading, this creates overhead for the user and usability is not good.

To solve this, following change should be done:
1. Carbon should take the minimal value of “carbon.number.of.cores.while.loading” property and “spark.executor.cores” in spark conf, instead of taking “carbon.number.of.cores.while.loading” directly.
2. The default value of “carbon.number.of.cores.while.loading” should be big, so that carbon has higher chance to take “spark.executor.cores” as the loading cores. The current default value is 2 which I think is too small.

Regards,
Jacky

> 在 2017年10月28日，上午9:51，Jacky Li <[hidden email]> 写道：
>
> Hi,
>
> I am not getting the intention behind this proposal. Is it because of the loading failure? If yes, we should find out why the loading failed.
> If not, then what is the intention?
>
> Actually I think the “carbon.number.of.cores.while.loading” property should be marked as obsolete.
> GLOBAL_SORT and NO_SORT should use spark default behavior
> LOCAL_SORT and BATCH_SORT should use “sparkSession.sparkContext.defaultParallism” as the cores to do local sorting
>
>
> Regards,
> Jacky
>
>> 在 2017年10月27日，上午8:43，cenyuhai11 <[hidden email]> 写道：
>>
>> When I insert data into carbondata from one table, I should do as the
>> following:
>> 1、select count(1) from table1
>> and then
>> 2、insert into table table1 select * from table1
>>
>> Why I should execute "select count(1) from table1" first?
>> because the number of tasks are compute by carbondata, it is releated to how
>> many executor hosts we have now!
>>
>> I don't think it is the right way. We should let spark to control the number
>> of tasks.
>> set the parameter "mapred.max.splits.size" is a common way to adjust the
>> number of tasks.
>>
>> Even when I do the step 2, some tasks still failed, it will increase the
>> insert time.
>>
>> So I sugguest that don't adjust the number of tasks, just use the default
>> behavior of spark.
>> And then if there are small files, add a fast merge job(merge data at
>> blocket level, just as )
>>
>> so we also need to set the default value of
>> "carbon.number.of.cores.while.loading" to 1
>>
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>