Apache CarbonData Dev Mailing List archive

Re: [DISCUSS] Data loading improvement

Posted by Jacky Li on May 22, 2017; 4:01am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-Data-loading-improvement-tp11429p13058.html

For sorting, I think more optimization we can do, I am currently thinking these:
1. Do not sort the whole TablePage, only KeyPage is required as the sort key

2. Should find a more memory efficient sorting algorithm than System.arraycopy which requires doubling space.

3. Should try to hold the KeyPage as well as the RowId in a compact data structure, it is best if it fits in CPU cache. Modern L3 CPU is larger than 8MB. For this, I am thinking to have a 8 bytes encoded format that includes RowID and SORT_COLUMNS (partial or full), for example, 2 bytes for RowId, remaining 6 bytes for 2 to 3 columns after dictionary encoding.
a) If we can hold the RowID + whole SORT_COLUMNS in 8 bytes, it will be most efficient to leverage CPU cache to do sorting, use in-place update approach while sorting. So no extra storage is needed, and the RowID + whole SORT_COLUMNS will be sorted.
b) If we can only hold the RowID + partial SORT_COLUMNS in 8 bytes, we can employ strategy like the sorting in Spark Tungsten project. (first compare the 8 bytes in cache, if it equals then compare remaining bytes in memory)

Regards,
Jacky

> 在 2017年5月22日，上午10:19，David CaiQiang <[hidden email]> 写道：
>
> As I known, System.arrayCopy of object array is a shallow copy, so I think
> both KeyPage and TablePage maybe have the same performance on Arrays.sort.
>
>
> -----
> Best Regards
> David Cai
> --
> View this message in context: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-Data-loading-improvement-tp11429p13056.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.