Apache CarbonData Dev Mailing List archive - Re: Query About Carbon Write Process : why always 10 Task get created when we write dataframe or rdd in carbon format in a write job or save job

Apache CarbonData Dev Mailing List archive

Re: Query About Carbon Write Process : why always 10 Task get created when we write dataframe or rdd in carbon format in a write job or save job

Posted by Jacky Li on May 26, 2019; 4:17am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Query-About-Carbon-Write-Process-why-always-10-Task-get-created-when-we-write-dataframe-or-rdd-in-cab-tp79200p79444.html

Hi Anshul Jain,

If you have specified the SORT_COLUMNS table property when creating table,
by default carbon will sort the input data during data loading (to build
index). The sorting is controlled by a table property called SORT_SCOPE, by
default it is LOCAL_SORT, it means it will sort the data locally within the
spark executor, without shuffling across executors. And there are other
options too, see http://carbondata.apache.org/ddl-of-carbondata.html

In your case, I guess it is using LOCAL_SORT. This sorting is using
multi-thread inside the executor, controlled by a CarbonProperty call
"NUM_THREAD_WHILE_LOADING".

If you want the spark default behavior like loading parquet, you can set the
SORT_SCOPE to NO_SORT.

Regards,
Jacky

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/