Apache CarbonData Dev Mailing List archive - Re: [Discussion]Query Regarding Task launch mechanism for data load operations

Apache CarbonData Dev Mailing List archive

Re: [Discussion]Query Regarding Task launch mechanism for data load operations

Posted by VenuReddy on Sep 17, 2020; 3:10pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tp98711p100555.html

Hi Vishal,

Thank you for the response.
Configuring load option `load_min_size_inmb` has helped to control the number of tasks
to launch in case of load from csv files and could eventually reduce the o/p
carbondata files from each executor when configured along with `carbon.number.of.cores.while.loading` dynamic property.

But in case of insert into table select from flow `loadDataFrame()`, problem
didn't get resolved as we have completely different task launching
approach(not same as in `loadDataFile()`. Do you have suggestions about any
paramter to fine tune in insert flow ?

1. Any way to launch more than 1 task per node ?

2. Any way to contrl the number of output carbondata files for target table,
when there are too many small sized carbondata files to read/select from src
table ? Otherwise it generates the output files equal to input files.
-> I tried carbon property,
`carbon.task.distribution`=`merge_small_files`. Could reduce the number of
files generated for target table. Scanrdd with
CARBON_TASK_DISTRIBUTION_MERGE_FILES used similar mechanism as global
partition load(considered filesMaxPartitionBytes, filesOpenCostInBytes and
defaultParallelism for split size).
But, this property is not dynamically configured. Probably for some
reason ? Confused if it is a good option to use that property in this
scenario.

Any suggestions would be very helpful.

regards,
Venu

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/