Apache CarbonData Dev Mailing List archive

[Discussion] About carbon.si.segment.merge feature

Posted by Ajantha Bhat on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-About-carbon-si-segment-merge-feature-tp103161.html

Hi,

when a carbon property *carbon.si.segment.merge = true*,

*a) local_sort SI segment loading (default) [All the SI columns are
involved]*

SI load will load with default local_sort. There will be two times data
loading, the first time is by querying the main table and creating the SI
segment (here the number of tasks launched is equal to carbon files present
in the main table segment), during these operations currently SI creates
many small files.
Then the merge operation will query the newly created SI segment and load
data by local_sort again (here few tasks are launched, one node one task),
so fewer files created.

*>> So, we can optimize the first time SI segment creation itself to use
one node one task logic and avoid creating small files and remove calling
merge operation. with this, we can remove carbon.si.segment.merge property
itself.*
*b) global_sort SI segment loading [All the SI columns are involved]*

SI load will load with a global sort. There will be two times data loading,
first time is by querying the main table and creating SI segment (here the
number of tasks launched (global_sort_partitions) is equal to carbon files
present in the main table segment), during this operations currently SI
creates many small files.
Then the merge operation will query the newly created SI segment and load
data by local sort again [there is no global sort logic presently] (here
few tasks are launched, one node one task), but this will disorder the
globally sorted data!

*>> So, the user can configure global sort partition, but if the user
didn't configure, code can use global_sort_partitions = number of active
nodes and load the data to avoid creating the small files and remove
calling merge operation. with this, we can remove carbon.si.segment.merge
property itself.*
*c) REFRESH INDEX <index_table> ON TABLE <main_table>*
If the user created the SI table in the previous version and has small
files, can use this command to merge the small files. But if the user drops
the index and creates it again, then no need for this command also [because
merge and creating new SI takes a similar time]. So, do we need to support
this command for the global sort?
If we decide to retain the rebuild command then for global_sort, we need to
add a new implementation as this command has only local sort code.

Let me know your opinion on this.

Thanks,
Ajantha