Query About Carbon Write Process : why always 10 Task get created when we write dataframe or rdd in carbon format in a write job or save job

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Query About Carbon Write Process : why always 10 Task get created when we write dataframe or rdd in carbon format in a write job or save job

Anshul  Jain
Hi Dev team ,


I am doing a test with carbondata to load a csv file of  600 gb and write it in carbon format in s3 location , while writing I can see only 10 task getting created in final step of execution job as I was using 10 nodes , while I have num-executor as 18 , so its degrading my job performance and How can I make change to let task no. equal to no. of executor for best performance.


Thanks & Regards,

Anshul Jain

Big Data Engineer

Impetus Infotech (India) Pvt. Ltd.

Tel: +91-0731-4743600/3662



________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Reply | Threaded
Open this post in threaded view
|

Re: Query About Carbon Write Process : why always 10 Task get created when we write dataframe or rdd in carbon format in a write job or save job

Jacky Li
Hi Anshul Jain,

If you have specified the SORT_COLUMNS table property when creating table,
by default carbon will sort the input data during data loading (to build
index). The sorting is controlled by a table property called SORT_SCOPE, by
default it is LOCAL_SORT, it means it will sort the data locally within the
spark executor, without shuffling across executors. And there are other
options too, see http://carbondata.apache.org/ddl-of-carbondata.html

In your case, I guess it is using LOCAL_SORT. This sorting is using
multi-thread inside the executor, controlled by a CarbonProperty call
"NUM_THREAD_WHILE_LOADING".

If you want the spark default behavior like loading parquet, you can set the
SORT_SCOPE to NO_SORT.

Regards,
Jacky



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Query About Carbon Write Process : why always 10 Task get created when we write dataframe or rdd in carbon format in a write job or save job

Jacky Li
one correction for my last reply, the property to control the number of
threads for sorting during data load is:
"carbon.number.of.cores.while.loading"

You can set it like
CarbonProperties.getInstance().addProperty("carbon.number.of.cores.while.loading",
8)


Regards,
Jacky



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/