Posted by
Jacky Li on
May 22, 2017; 4:01am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-Data-loading-improvement-tp11429p13058.html
For sorting, I think more optimization we can do, I am currently thinking these:
1. Do not sort the whole TablePage, only KeyPage is required as the sort key
2. Should find a more memory efficient sorting algorithm than System.arraycopy which requires doubling space.
3. Should try to hold the KeyPage as well as the RowId in a compact data structure, it is best if it fits in CPU cache. Modern L3 CPU is larger than 8MB. For this, I am thinking to have a 8 bytes encoded format that includes RowID and SORT_COLUMNS (partial or full), for example, 2 bytes for RowId, remaining 6 bytes for 2 to 3 columns after dictionary encoding.
a) If we can hold the RowID + whole SORT_COLUMNS in 8 bytes, it will be most efficient to leverage CPU cache to do sorting, use in-place update approach while sorting. So no extra storage is needed, and the RowID + whole SORT_COLUMNS will be sorted.
b) If we can only hold the RowID + partial SORT_COLUMNS in 8 bytes, we can employ strategy like the sorting in Spark Tungsten project. (first compare the 8 bytes in cache, if it equals then compare remaining bytes in memory)
Regards,
Jacky