Login  Register

Re: [DISCUSS] Data loading improvement

Posted by ravipesala on May 22, 2017; 9:23am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-Data-loading-improvement-tp11429p13063.html

Hi,

I think you are referring to tungsten sort, there they tried keep pointer
and key together to simulate cache aware computation. It is only possible
if the sort keys are always starts with fixed keys like dictionary keys. So
basically first encountered few dictionary columns can be kept along with
pointer and starts sorting, if that is equal then we can go and retrieve
remaining key and compare it.
It is simple to implement in our current design as our current
implementation of unsafe sort is also inspired from tungsten sort.

Regards,
Ravindra.

On 22 May 2017 at 09:31, Jacky Li <[hidden email]> wrote:

> For sorting, I think more optimization we can do, I am currently thinking
> these:
> 1. Do not sort the whole TablePage, only KeyPage is required as the sort
> key
>
> 2. Should find a more memory efficient sorting algorithm than
> System.arraycopy which requires doubling space.
>
> 3. Should try to hold the KeyPage as well as the RowId in a compact data
> structure, it is best if it fits in CPU cache. Modern L3 CPU is larger than
> 8MB. For this, I am thinking to have a 8 bytes encoded format that includes
> RowID and SORT_COLUMNS (partial or full), for example, 2 bytes for RowId,
> remaining 6 bytes for 2 to 3 columns after dictionary encoding.
>      a) If we can hold the RowID + whole SORT_COLUMNS in 8 bytes, it will
> be most efficient to leverage CPU cache to do sorting, use in-place update
> approach while sorting. So no extra storage is needed, and the RowID +
> whole SORT_COLUMNS will be sorted.
>      b) If we can only hold the RowID + partial SORT_COLUMNS in 8 bytes,
> we can employ strategy like the sorting in Spark Tungsten project. (first
> compare the 8 bytes in cache, if  it equals then compare remaining bytes in
> memory)
>
> Regards,
> Jacky
>
> > 在 2017年5月22日,上午10:19,David CaiQiang <[hidden email]> 写道:
> >
> > As I known, System.arrayCopy of object array is a shallow copy, so I
> think
> > both KeyPage and TablePage maybe have the same performance on
> Arrays.sort.
> >
> >
> > -----
> > Best Regards
> > David Cai
> > --
> > View this message in context: http://apache-carbondata-dev-m
> ailing-list-archive.1130556.n5.nabble.com/DISCUSS-Data-loadi
> ng-improvement-tp11429p13056.html
> > Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>
>


--
Thanks & Regards,
Ravi