Apache CarbonData Dev Mailing List archive

[Discussion] About carbon.si.segment.merge feature

Classic

List

Threaded

6 messages Options

Ajantha Bhat

Nov 06, 2020; 11:11am

[Discussion] About carbon.si.segment.merge feature

Hi,

when a carbon property *carbon.si.segment.merge = true*,

*a) local_sort SI segment loading (default) [All the SI columns are
involved]*

SI load will load with default local_sort. There will be two times data
loading, the first time is by querying the main table and creating the SI
segment (here the number of tasks launched is equal to carbon files present
in the main table segment), during these operations currently SI creates
many small files.
Then the merge operation will query the newly created SI segment and load
data by local_sort again (here few tasks are launched, one node one task),
so fewer files created.

*>> So, we can optimize the first time SI segment creation itself to use
one node one task logic and avoid creating small files and remove calling
merge operation. with this, we can remove carbon.si.segment.merge property
itself.*
*b) global_sort SI segment loading [All the SI columns are involved]*

SI load will load with a global sort. There will be two times data loading,
first time is by querying the main table and creating SI segment (here the
number of tasks launched (global_sort_partitions) is equal to carbon files
present in the main table segment), during this operations currently SI
creates many small files.
Then the merge operation will query the newly created SI segment and load
data by local sort again [there is no global sort logic presently] (here
few tasks are launched, one node one task), but this will disorder the
globally sorted data!

*>> So, the user can configure global sort partition, but if the user
didn't configure, code can use global_sort_partitions = number of active
nodes and load the data to avoid creating the small files and remove
calling merge operation. with this, we can remove carbon.si.segment.merge
property itself.*
*c) REFRESH INDEX <index_table> ON TABLE <main_table>*
If the user created the SI table in the previous version and has small
files, can use this command to merge the small files. But if the user drops
the index and creates it again, then no need for this command also [because
merge and creating new SI takes a similar time]. So, do we need to support
this command for the global sort?
If we decide to retain the rebuild command then for global_sort, we need to
add a new implementation as this command has only local sort code.

Let me know your opinion on this.

Thanks,
Ajantha

Ajantha Bhat

Nov 06, 2020; 1:07pm

Re: [Discussion] About carbon.si.segment.merge feature

A small update on merge flow,
Currently, in local_sort SI merge, task launch is based on size, how many
carbon files is formed after the merge, that many tasks will be launched
for merge [CarbonSIRebuildRDD.internalGetPartitions].
Global_sort merge also implement identifying global_sort_partitions based
on how many carbon files is formed after merge (similar to local sort
flow)

But we need to conclude on merge flow is really required or we can just
keep SI loading itself as 1 node 1 task logic [similar to our main table
local sort] and avoid the need for the merge operation.

Thanks,
Ajantha

On Fri, Nov 6, 2020 at 4:41 PM Ajantha Bhat <[hidden email]> wrote:

> Hi,
>
> when a carbon property *carbon.si.segment.merge = true*,
>
> *a) local_sort SI segment loading (default) [All the SI columns are
> involved]*
>
> SI load will load with default local_sort. There will be two times data
> loading, the first time is by querying the main table and creating the SI
> segment (here the number of tasks launched is equal to carbon files present
> in the main table segment), during these operations currently SI creates
> many small files.
> Then the merge operation will query the newly created SI segment and load
> data by local_sort again (here few tasks are launched, one node one task),
> so fewer files created.
>
> *>> So, we can optimize the first time SI segment creation itself to use
> one node one task logic and avoid creating small files and remove calling
> merge operation. with this, we can remove carbon.si.segment.merge property
> itself.*
> *b) global_sort SI segment loading [All the SI columns are involved]*
>
> SI load will load with a global sort. There will be two times data
> loading, first time is by querying the main table and creating SI segment
> (here the number of tasks launched (global_sort_partitions) is equal to
> carbon files present in the main table segment), during this operations
> currently SI creates many small files.
> Then the merge operation will query the newly created SI segment and load
> data by local sort again [there is no global sort logic presently] (here
> few tasks are launched, one node one task), but this will disorder the
> globally sorted data!
>
> *>> So, the user can configure global sort partition, but if the user
> didn't configure, code can use global_sort_partitions = number of active
> nodes and load the data to avoid creating the small files and remove
> calling merge operation. with this, we can remove carbon.si.segment.merge
> property itself.*
> *c) REFRESH INDEX <index_table> ON TABLE <main_table>*
> If the user created the SI table in the previous version and has small
> files, can use this command to merge the small files. But if the user drops
> the index and creates it again, then no need for this command also [because
> merge and creating new SI takes a similar time]. So, do we need to support
> this command for the global sort?
> If we decide to retain the rebuild command then for global_sort, we need
> to add a new implementation as this command has only local sort code.
>
> Let me know your opinion on this.
>
> Thanks,
> Ajantha
>

... [show rest of quote]

David CaiQiang

Nov 09, 2020; 7:05am

Re: [Discussion] About carbon.si.segment.merge feature

In reply to this post by Ajantha Bhat

hi Ajantha,
Agree to remove "carbon.si.segment.merge"

1. dynamic decide the number for the loading tasks
Before loading the SI segment, it is easy to estimate the total size of
this SI segment.
So better to dynamic decide the number for the loading tasks to avoid
small carbon files in the SI segment.

2. can we use global_sort for SI by default?
SI is used to speed up filter query, global_sort can do better.
We need global_sort for SI.

3. use reindex instead of refresh index
If Refresh index is only used to merge small files, reindex will be
better(should implement point 1).
So, can we remove Refresh index too?

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

akashrn5

Nov 09, 2020; 7:34am

Re: [Discussion] About carbon.si.segment.merge feature

In reply to this post by Ajantha Bhat

Hi,

Its better to remove i feel, as lot of code will be avoided and we can do it
right the first time we do it.

but please consider below points.

1. may be once we can test the time difference of global sort and exiting
local sort load time, may be per segment basis, so that we can have a
overall time difference we can get in load, basically if we can note down
the tradeoff time, that's better for future reference and in user
perspective also.

2. Also can you check the refresh index and reload time diff, because we
need to see if all users fine with dropping and recreating again.

Regards,
Akash

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Ajantha Bhat

Nov 10, 2020; 12:01pm

Re: [Discussion] About carbon.si.segment.merge feature

@David:
a) yes, SI can use global by default.
b) Handling SI original load itself to launch task based on SI segment size
(need to figure out how to estimate) is better,
else we have to go with one task per node logic (similar to main table
local sort). But current logic needs to changed to avoid small files
problem.
c) Refresh Index for SI is currently only for merging the small files, we
have to rename this command I think. Naming doesn't make sense.
and ReIndex is for loading the missed SI segments from main table, cannot
use it for merge.

@Akash:
a) Loading time difference between SI global_sort and local_sort is the
same as the Data loading difference of any table global sort and local
sort. we already have it.
b) yes, after implementing new SI load logic (task launch based on segment
size), we can compare current with refresh index time. If not much
difference we can remove refresh index support for SI.

Thanks,
Ajantha

On Mon, Nov 9, 2020 at 1:04 PM akashrn5 <[hidden email]> wrote:

> Hi,
>
> Its better to remove i feel, as lot of code will be avoided and we can do
> it
> right the first time we do it.
>
> but please consider below points.
>
> 1. may be once we can test the time difference of global sort and exiting
> local sort load time, may be per segment basis, so that we can have a
> overall time difference we can get in load, basically if we can note down
> the tradeoff time, that's better for future reference and in user
> perspective also.
>
> 2. Also can you check the refresh index and reload time diff, because we
> need to see if all users fine with dropping and recreating again.
>
> Regards,
> Akash
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

... [show rest of quote]

Karan-c980

Jan 29, 2021; 11:05am

Re: [Discussion] About carbon.si.segment.merge feature

This post was updated on Jan 29, 2021; 12:04pm.

In reply to this post by Ajantha Bhat

We can remove merge operation of data files in SI segment, if we avoid small
file creation during SI load itself by following methods.

a) By estimating the SI load size and launch task based on Block size
threshold for SI. For eg:
if blocksize for SI is 1Gb and SI segment load size is 3GB then launch 3
task
if blocksize for SI is 1Gb and SI segment load size is 512MB then launch 1
task.

Problem with this method : We can only estimate Uncompressed size for a SI
segment load. For eg: In Uncompressed form SI segment load size 3GB and
blocksize for SI is 1GB. For this scenario we will launch 3 tasks, but it is
possible that after compression this 3GB size reduces to 1GB. So again we
will be having 3 files of 333MB (approx) each. So in this approach we are
launching more tasks than required.

b) Hardcode the number of tasks by 1 node 1 task logic. Here we will launch
tasks equal to number of nodes in a cluster.

1. If SI is created with local/global sort and main table is non-partition
table --> This approach will give benefit if number of nodes in cluster are
less. But if number of nodes are more(100 nodes) and data is less(1GB) this
will result in creating small small files.
2. If SI is created with local/global sort and main table is partition table
--> Data in main table is partitioned over partition column. But data in SI
segment is not partitioned. So there can be many small small carbondata
files present inside main table segment that depends on cardinality of
partition column. So 1 node 1 task logic can give benefit here if number of
nodes are less. But again if number of nodes are greater than or equal to
the cardinality of partition column in main table. It will create many small
files.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/