[improvement] Support unsafe in-memory sort in carbondata

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[improvement] Support unsafe in-memory sort in carbondata

ravipesala
Hi All,

In the current carbondata system loading performance is not so encouraging
since we need to sort the data at executor level for data loading.
Carbondata collects batch of data and sorts before dumping to the temporary
files and finally it does merge sort from those temporary files to finish
sorting. Here we face two major issues , one is disk IO and second is GC
issue. Even though we dump to the file still carbondata face lot of GC
issue since we sort batch data in-memory before dumping to the temporary
files.

To solve the above problems we can introduce Unsafe Storage and Unsafe sort.
Unsafe Storage : User can configure the memory limit to keep the amount of
data to in-memory. Here we can keep all the data in continuous memory
location either on off-heap or on-heap using Unsafe. Once configure limit
exceeds remaining data will be spilled to disk.
Unsafe Sort : The data which is store in-memory using Unsafe can be sorted
using Unsafe sort.

We can take inspiration from Spark to do Unsafe implementations effectively.

--
Thanks & Regards,
Ravindra
Reply | Threaded
Open this post in threaded view
|

Re: [improvement] Support unsafe in-memory sort in carbondata

Venkata Gollamudi
This proposal looks good, should improve performance and GC issues during
dataload. Please create an issue in Jira. We can create unsafe functions in
common module (just like spark) to allow them to be used across
modules/components, also can check if can reuse any from spark unsafe.

On Sun, Nov 27, 2016 at 11:40 PM, Ravindra Pesala <[hidden email]>
wrote:

> Hi All,
>
> In the current carbondata system loading performance is not so encouraging
> since we need to sort the data at executor level for data loading.
> Carbondata collects batch of data and sorts before dumping to the temporary
> files and finally it does merge sort from those temporary files to finish
> sorting. Here we face two major issues , one is disk IO and second is GC
> issue. Even though we dump to the file still carbondata face lot of GC
> issue since we sort batch data in-memory before dumping to the temporary
> files.
>
> To solve the above problems we can introduce Unsafe Storage and Unsafe
> sort.
> Unsafe Storage : User can configure the memory limit to keep the amount of
> data to in-memory. Here we can keep all the data in continuous memory
> location either on off-heap or on-heap using Unsafe. Once configure limit
> exceeds remaining data will be spilled to disk.
> Unsafe Sort : The data which is store in-memory using Unsafe can be sorted
> using Unsafe sort.
>
> We can take inspiration from Spark to do Unsafe implementations
> effectively.
>
> --
> Thanks & Regards,
> Ravindra
>