Apache CarbonData Dev Mailing List archive - Re: [Discussion]Query Regarding Task launch mechanism for data load operations

Apache CarbonData Dev Mailing List archive

Re: [Discussion]Query Regarding Task launch mechanism for data load operations

Posted by kumarvishal09 on Aug 17, 2020; 2:25pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tp98711p98798.html

Hi Venu,
@Ramana mentioned most of the cases were optimized for local sort.
Yes we can use Global sort like solution for No sort case so the number of
tasks can be based on number of executor+core and we can use compaction
after load, to handle small files.

I remember we do have one Block Assignment Strategy for loading based on
minimum size. Can u please check. This feature may need stabilization.
Hope works the same as expected.
*org.apache.carbondata.processing.util.CarbonLoaderUtil.BlockAssignmentStrategy*

/**
* The node loads the smallest amount of data
*/
@CarbonProperty
public static final String CARBON_LOAD_MIN_SIZE_INMB = "load_min_size_inmb";

/**
* the node minimum load data default value
*/
public static final String CARBON_LOAD_MIN_SIZE_INMB_DEFAULT = "0";

-Regards

Kumar Vishal

On Mon, Aug 17, 2020 at 9:39 PM Venkata Gollamudi <[hidden email]>
wrote:

> Hi Varun,
>
> Yes, previously most cases were tuned for LOCAL_SORT, where merging will
> automatically happen. But certainly data loading flow can be improved to
> do it based on data size, rather than a fixed configuration.
> However old behaviour might also be required, if the user has to control
> the maximum number of partitions in case data size is too big. This
> configuration has started as data loading cores are not transparent to
> spark, mainly in case of LOCAL_SORT.
>
> Same thing is applicable for insert into scenario also, as you said
> coalescing will reduce the load performance.
>
> Regards,
> Ramana
>
> On Fri, Aug 14, 2020 at 3:25 PM David CaiQiang <[hidden email]>
> wrote:
>
> > This mechanism will work fine for LOCAL_SORT loading of big data and the
> > small cluster with big executor.
> >
> > If it doesn't match these conditions, better consider a new solution to
> > adapter the generic scenario.
> >
> > I suggest re-factoring NO_SORT, maybe we can check and improve the
> > global_sort solution.
> >
> > The solution should support both NO_SORT and GLOBAL_SORT, and
> automatically
> > determines the number of partitions to avoid small file issue.
> >
> >
> >
> >
> > -----
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

kumar vishal