Hi,
when a carbon property *carbon.si.segment.merge = true*, *a) local_sort SI segment loading (default) [All the SI columns are involved]* SI load will load with default local_sort. There will be two times data loading, the first time is by querying the main table and creating the SI segment (here the number of tasks launched is equal to carbon files present in the main table segment), during these operations currently SI creates many small files. Then the merge operation will query the newly created SI segment and load data by local_sort again (here few tasks are launched, one node one task), so fewer files created. *>> So, we can optimize the first time SI segment creation itself to use one node one task logic and avoid creating small files and remove calling merge operation. with this, we can remove carbon.si.segment.merge property itself.* *b) global_sort SI segment loading [All the SI columns are involved]* SI load will load with a global sort. There will be two times data loading, first time is by querying the main table and creating SI segment (here the number of tasks launched (global_sort_partitions) is equal to carbon files present in the main table segment), during this operations currently SI creates many small files. Then the merge operation will query the newly created SI segment and load data by local sort again [there is no global sort logic presently] (here few tasks are launched, one node one task), but this will disorder the globally sorted data! *>> So, the user can configure global sort partition, but if the user didn't configure, code can use global_sort_partitions = number of active nodes and load the data to avoid creating the small files and remove calling merge operation. with this, we can remove carbon.si.segment.merge property itself.* *c) REFRESH INDEX <index_table> ON TABLE <main_table>* If the user created the SI table in the previous version and has small files, can use this command to merge the small files. But if the user drops the index and creates it again, then no need for this command also [because merge and creating new SI takes a similar time]. So, do we need to support this command for the global sort? If we decide to retain the rebuild command then for global_sort, we need to add a new implementation as this command has only local sort code. Let me know your opinion on this. Thanks, Ajantha |
A small update on merge flow,
Currently, in local_sort SI merge, task launch is based on size, how many carbon files is formed after the merge, that many tasks will be launched for merge [CarbonSIRebuildRDD.internalGetPartitions]. Global_sort merge also implement identifying global_sort_partitions based on how many carbon files is formed after merge (similar to local sort flow) But we need to conclude on merge flow is really required or we can just keep SI loading itself as 1 node 1 task logic [similar to our main table local sort] and avoid the need for the merge operation. Thanks, Ajantha On Fri, Nov 6, 2020 at 4:41 PM Ajantha Bhat <[hidden email]> wrote: > Hi, > > when a carbon property *carbon.si.segment.merge = true*, > > *a) local_sort SI segment loading (default) [All the SI columns are > involved]* > > SI load will load with default local_sort. There will be two times data > loading, the first time is by querying the main table and creating the SI > segment (here the number of tasks launched is equal to carbon files present > in the main table segment), during these operations currently SI creates > many small files. > Then the merge operation will query the newly created SI segment and load > data by local_sort again (here few tasks are launched, one node one task), > so fewer files created. > > *>> So, we can optimize the first time SI segment creation itself to use > one node one task logic and avoid creating small files and remove calling > merge operation. with this, we can remove carbon.si.segment.merge property > itself.* > *b) global_sort SI segment loading [All the SI columns are involved]* > > SI load will load with a global sort. There will be two times data > loading, first time is by querying the main table and creating SI segment > (here the number of tasks launched (global_sort_partitions) is equal to > carbon files present in the main table segment), during this operations > currently SI creates many small files. > Then the merge operation will query the newly created SI segment and load > data by local sort again [there is no global sort logic presently] (here > few tasks are launched, one node one task), but this will disorder the > globally sorted data! > > *>> So, the user can configure global sort partition, but if the user > didn't configure, code can use global_sort_partitions = number of active > nodes and load the data to avoid creating the small files and remove > calling merge operation. with this, we can remove carbon.si.segment.merge > property itself.* > *c) REFRESH INDEX <index_table> ON TABLE <main_table>* > If the user created the SI table in the previous version and has small > files, can use this command to merge the small files. But if the user drops > the index and creates it again, then no need for this command also [because > merge and creating new SI takes a similar time]. So, do we need to support > this command for the global sort? > If we decide to retain the rebuild command then for global_sort, we need > to add a new implementation as this command has only local sort code. > > Let me know your opinion on this. > > Thanks, > Ajantha > |
In reply to this post by Ajantha Bhat
hi Ajantha,
Agree to remove "carbon.si.segment.merge" 1. dynamic decide the number for the loading tasks Before loading the SI segment, it is easy to estimate the total size of this SI segment. So better to dynamic decide the number for the loading tasks to avoid small carbon files in the SI segment. 2. can we use global_sort for SI by default? SI is used to speed up filter query, global_sort can do better. We need global_sort for SI. 3. use reindex instead of refresh index If Refresh index is only used to merge small files, reindex will be better(should implement point 1). So, can we remove Refresh index too? ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
In reply to this post by Ajantha Bhat
Hi,
Its better to remove i feel, as lot of code will be avoided and we can do it right the first time we do it. but please consider below points. 1. may be once we can test the time difference of global sort and exiting local sort load time, may be per segment basis, so that we can have a overall time difference we can get in load, basically if we can note down the tradeoff time, that's better for future reference and in user perspective also. 2. Also can you check the refresh index and reload time diff, because we need to see if all users fine with dropping and recreating again. Regards, Akash -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
@David:
a) yes, SI can use global by default. b) Handling SI original load itself to launch task based on SI segment size (need to figure out how to estimate) is better, else we have to go with one task per node logic (similar to main table local sort). But current logic needs to changed to avoid small files problem. c) Refresh Index for SI is currently only for merging the small files, we have to rename this command I think. Naming doesn't make sense. and ReIndex is for loading the missed SI segments from main table, cannot use it for merge. @Akash: a) Loading time difference between SI global_sort and local_sort is the same as the Data loading difference of any table global sort and local sort. we already have it. b) yes, after implementing new SI load logic (task launch based on segment size), we can compare current with refresh index time. If not much difference we can remove refresh index support for SI. Thanks, Ajantha On Mon, Nov 9, 2020 at 1:04 PM akashrn5 <[hidden email]> wrote: > Hi, > > Its better to remove i feel, as lot of code will be avoided and we can do > it > right the first time we do it. > > but please consider below points. > > 1. may be once we can test the time difference of global sort and exiting > local sort load time, may be per segment basis, so that we can have a > overall time difference we can get in load, basically if we can note down > the tradeoff time, that's better for future reference and in user > perspective also. > > 2. Also can you check the refresh index and reload time diff, because we > need to see if all users fine with dropping and recreating again. > > Regards, > Akash > > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
This post was updated on .
In reply to this post by Ajantha Bhat
We can remove merge operation of data files in SI segment, if we avoid small
file creation during SI load itself by following methods. a) By estimating the SI load size and launch task based on Block size threshold for SI. For eg: if blocksize for SI is 1Gb and SI segment load size is 3GB then launch 3 task if blocksize for SI is 1Gb and SI segment load size is 512MB then launch 1 task. Problem with this method : We can only estimate Uncompressed size for a SI segment load. For eg: In Uncompressed form SI segment load size 3GB and blocksize for SI is 1GB. For this scenario we will launch 3 tasks, but it is possible that after compression this 3GB size reduces to 1GB. So again we will be having 3 files of 333MB (approx) each. So in this approach we are launching more tasks than required. b) Hardcode the number of tasks by 1 node 1 task logic. Here we will launch tasks equal to number of nodes in a cluster. 1. If SI is created with local/global sort and main table is non-partition table --> This approach will give benefit if number of nodes in cluster are less. But if number of nodes are more(100 nodes) and data is less(1GB) this will result in creating small small files. 2. If SI is created with local/global sort and main table is partition table --> Data in main table is partitioned over partition column. But data in SI segment is not partitioned. So there can be many small small carbondata files present inside main table segment that depends on cardinality of partition column. So 1 node 1 task logic can give benefit here if number of nodes are less. But again if number of nodes are greater than or equal to the cardinality of partition column in main table. It will create many small files. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |