Apache CarbonData Dev Mailing List archive

Improving show segment info

Classic

List

Threaded

12 messages Options

Jacky Li

Improving show segment info

Hi community,

Currently for SHOW SEGMENT command, carbon will print:

+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+----+
|SegmentSequenceId |Status |Load Start Time |Load End Time |Merged To|File Format|Data Size|Index Size|Path|
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+----+
|5 |Compacted|2020-02-15 22:19:06.327|2020-02-15 22:19:06.516|4.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|4.1|Success |2020-02-15 22:19:06.327|2020-02-15 22:19:06.659|NA |COLUMNAR_V3|640.0B |425.0B |NA |
|4 |Compacted|2020-02-15 22:19:05.894|2020-02-15 22:19:06.188|4.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|3 |Compacted|2020-02-15 22:19:05.151|2020-02-15 22:19:05.391|2.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|2.1|Compacted|2020-02-15 22:19:05.151|2020-02-15 22:19:05.592|0.2 |COLUMNAR_V3|640.0B |425.0B |NA |
|2 |Compacted|2020-02-15 22:19:04.804|2020-02-15 22:19:05.023|2.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|1 |Compacted|2020-02-15 22:19:04.006|2020-02-15 22:19:04.242|0.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|0.2|Success |2020-02-15 22:19:05.151|2020-02-15 22:19:05.79 |NA |COLUMNAR_V3|658.0B |425.0B |NA |
|0.1|Compacted|2020-02-15 22:19:04.006|2020-02-15 22:19:04.704|0.2 |COLUMNAR_V3|640.0B |425.0B |NA |
|0 |Compacted|2020-02-15 22:19:02.335|2020-02-15 22:19:03.775|0.1 |COLUMNAR_V3|630.0B |421.0B |NA |
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+——+

While it prints basic info of segments, it does not support following usecase:

UserCase 1. Show for partition table
For partition table, have no way to know about the which partition that each segment mapped to. The partition-segment mapping is two way mapping, user may want to check which segments is belong to certain partition.

==> Suggest to add partition information for each segment when showing segments

UserCase 2. Show load delay
Have no way to know about the event time in the segment. By event time, I mean the timestamp column value in the record being loaded. In many cases, user wanted to know the load delay which indicate whether the system loading is catching up with the incoming data.

==> For this requirement, one suggestion is that we can show the minmax of the event time column in SHOW SEGMENTS command, then user can know about the difference of the Load Start Time and Event Time in the record. This solution depends on Segment level minmax (PR3584), and we may need user to specify the event_time column name in table property when creating table or use alter table to add it for existing table.

UserCase 3. Show load speed
Currenlty user need to calculate the load speed by Load Start Time and Load End Time, it is not so convenient for user

==> We can keep the Load Start Time and calculate the elapsed time and throughput(in MBps) by SHOW SEGMENTS automatically.

UserCase 4. Show more information and give more control to user, like filtering and sorting
Besides above information, actually there is more information kept in segment metadata but have not shown to the user. And sometimes user wanted to sort the output by certain field, like sort by load time

==> To show more detail info of the segment, I can think of two ways:
1) add SHOW SEGMENTS EXTENDED option
2) To avoid making the output table too wide to read, I prefer to add DESC SEGMENTS command to show detail info for one segment by specifying the segment id by user.

And, to give more control for user, we can add some options for sorting and filter.

In summary, to support all above use cases, I suggest to change the command like:

Example 1:
// show basic info
SHOW SEGMENTS ON table1
will output
+---+---------+-----------------------+-----------------------+------+-----------+
|ID |Status |Load Start Time |Event Time Min |Spent |Throughput |
+---+---------+-----------------------+-----------------------+------+-----------+
|2 |Success |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|3m |28MBps |
|1 |Success |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|28s |30MBps |
|0 |Success |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|4m6s |25MBps |
+---+---------+-----------------------+-----------------------+------+-----------+

this is short enough so that it can fit in most of the console line width

Example 2:
// To show partition for each segment
SHOW SEGMENTS ON table1 PARTITION
output:
+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status |Load Start Time |Event Time Min |PARTITION |
+---+---------+-----------------------+-----------------------+--------------------+
|2 |Success |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|dt=29,planet=earth |
|1 |Success |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth |
|0 |Success |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|dt=34,planet=mars |
+---+---------+-----------------------+-----------------------+--------------------+

Example 3:
// To filter out one partition
SHOW SEGMENTS ON table1 OPTOINS (‘partition’=‘dt=30,planet=earth’)

+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status |Load Start Time |Event Time Min |PARTITION |
+---+---------+-----------------------+-----------------------+--------------------+
|1 |Success |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth |
+---+---------+-----------------------+-----------------------+——————————+

Example 4:
// To sort by load start time
SHOW SEGMENTS ON table1 OPTOINS (’sort_by’=‘load_start_time')

+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status |Load Start Time |Event Time Min |PARTITION |
+---+---------+-----------------------+-----------------------+--------------------+
|0 |Success |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|dt=34,planet=mars |
|1 |Success |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth |
|2 |Success |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|dt=29,planet=earth |
+---+---------+-----------------------+-----------------------+--------------------+

The sort field can be: id, load_start_time, event_time_min, event_time_max, data_size, index_size, and column_name_min or column_name_max, where column_name can be any literal string and exist in the table schema.

Example 5:
// To show more detail info
DESC SEGMENT 2 ON table1

This will show all information for segment id 2. Information including not only table status file but also info in segment file.

I hope this suggestion can make segment info more visible to the user.
What do you think?

Regards,
Jacky

akashrn5

Re: Improving show segment info

This post was updated on .

Hi likun,

Thanks for proposing this

+1, its a good way and its better to provide user more info about segment.

I have following doubts and suggestions.

1. You have mentioned DDL as Show segments On table, but currently it is
show segments for table, i suggest not to change the current one,we can add
more optional fields for more segment info.

1. Incase of multiple partiton columns,the partition info can be huge, in
that case what is the format you are trying to show in sow segment with
partition.
Can we ask user to give partition column also as optional field in command?
So that if there are multiple partition columns, we can reduce the info in
output to make the info more clean.

When the optional field of partiton column is not given, then we should have
a better formatting to show the huge information.

DDL should be something like

SHOW SEGMENTS ON table1 PARTITION (workgroupcategory)

SHOW SEGMENTS ON table1 PARTITION (workgroupcategory, empname,
designation)

2. You have shown the comand which filter out the partition, i think its
better if we can filter out the partition columns also as mentioned in above
case, what do you say?

3. already show segments have 9 columns as output and for show history its
10 columns, now to these basic info event time and load speed is getting
added, if i am not wrong.
in the sample output its shown that there is a spent column, i think its not
required as we have throughput column.

One suggestion is, why cant we merge loadstart and endtime, event time min
and max, data and index size as single columns with proper formatted data?
With this we can
reduce the size of output in cnsole size and include more columns in basic
output.

Thanks,

Regards,
Akash R Nilugal

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Jacky Li

Re: Improving show segment info

> 2020年2月16日下午4:58，akashrn5 <[hidden email]> 写道：
>
> Hi likun,
>
> Thanks for proposing this
>
> +1, its a good way and its better to provide user more info about segment.
>
> I have following doubts and suggestions.
>
> 1. You have mentioned DDL as Show segments On table, but currently it is
> show segments for table, i suggest not to change the current one,we can add
> more optional fields for more segment info.

Agree

>
> 1. Incase of multiple partiton columns,the partition info can be huge, in
> that case what is the format you are trying to show in sow segment with
> partition.
> Can we ask user to give partition column also as optional field in command?
> So that if there are multiple partition columns, we can reduce the info in
> output to make the info more clean.
>
> When the optional field of partiton column is not given, then we should have
> a better formatting to show the huge information.
>
> DDL should be something like
>
> SHOW SEGMENTS ON table1 PARTITION (workgroupcategory int)
>
> SHOW SEGMENTS ON table1 PARTITION (workgroupcategory int, empname String,
> designation String)
>

I got your point, but given the partition column by user does not help reducing the information. If we want to reduce the amount of the information, we should ask user to give the filter on partition column like example 3 in my original mail.

> 2. You have shown the comand which filter out the partition, i think its
> better if we can filter out the partition columns also as mentioned in above
> case, what do you say?

Do you mean skip the partition columns in the SHOW SEGMENTS result? For example 3.

>
> 3. already show segments have 9 columns as output and for show history its
> 10 columns, now to these basic info event time and load speed is getting
> added, if i am not wrong.
> in the sample output its shown that there is a spent column, i think its not
> required as we have throughput column.

No, by showing example 1, actually I want to change the default output of the SHOW SEGMENTS to those 6 fields only in example 1.
I suggest having both `spent` and `throughput` so that user does not need to calculate himself.

>
> One suggestion is, why cant we merge loadstart and endtime, event time min
> and max, data and index size as single columns with proper formatted data?
> With this we can
> reduce the size of output in cnsole size and include more columns in basic
> output.

In example 1, only 6 fields are shown, and some field name changed (like for SegmentSequenceId changed to ID) to make it shorter.
I have tested these 6 fields can fit in console in one line.

>
>
> Thanks,
>
> Regards,
> Akash R Nilugal
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

akashrn5

Re: Improving show segment info

Hi,

>I got your point, but given the partition column by user does not help
reducing the information. If we want to reduce the >amount of the
information, we should ask user to give the filter on partition column like
example 3 in my original mail.

1. my concern was if there are more partition column and instead of
partition value filter, i was thinking of having filter on partition column

>Do you mean skip the partition columns in the SHOW SEGMENTS result? For
example 3.
Same as above point 1.

>No, by showing example 1, actually I want to change the default output of
the SHOW SEGMENTS to those 6 fields only >in example 1.
>I suggest having both `spent` and `throughput` so that user does not need
to calculate himself.

I agree, then what about the old default infos which we have now, like data
size, index size, mergedTo,format? All these infos will be moved to *DESC
SEGMENT 2 ON table1*, if i am not wrong?

Regards,
Akash R Nilugal

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Ajantha Bhat

Re: Improving show segment info

Hi Likun,

I think this display command is hard to maintain if we provide all these
options manually.

*1. How about creating a "tableName.segmentInfo" child table for each main
table?* user can query this table and easy to support filter, group by. we
just have to finalize the schema of this table.

2. For each partition to find out which all the segments it is mapped to,
currently we don't store this information anywhere. so, where are you
planning to store it? I don't think we need to calculate it every time.

Thanks,
Ajantha

On Sun, Feb 16, 2020 at 10:48 PM akashrn5 <[hidden email]> wrote:

> Hi,
>
> >I got your point, but given the partition column by user does not help
> reducing the information. If we want to reduce the >amount of the
> information, we should ask user to give the filter on partition column like
> example 3 in my original mail.
>
> 1. my concern was if there are more partition column and instead of
> partition value filter, i was thinking of having filter on partition column
>
>
> >Do you mean skip the partition columns in the SHOW SEGMENTS result? For
> example 3.
> Same as above point 1.
>
> >No, by showing example 1, actually I want to change the default output of
> the SHOW SEGMENTS to those 6 fields only >in example 1.
> >I suggest having both `spent` and `throughput` so that user does not need
> to calculate himself.
>
> I agree, then what about the old default infos which we have now, like data
> size, index size, mergedTo,format? All these infos will be moved to *DESC
> SEGMENT 2 ON table1*, if i am not wrong?
>
> Regards,
> Akash R Nilugal
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Ajantha Bhat

Re: Improving show segment info

3. And about event time. I don't think we need to keep it for every row. It
is a waste of storage size. can we keep in loadMetadetails or file level ?

On Mon, Feb 17, 2020 at 11:10 AM Ajantha Bhat <[hidden email]> wrote:

> Hi Likun,
>
> I think this display command is hard to maintain if we provide all these
> options manually.
>
> *1. How about creating a "tableName.segmentInfo" child table for each main
> table?* user can query this table and easy to support filter, group by.
> we just have to finalize the schema of this table.
>
> 2. For each partition to find out which all the segments it is mapped to,
> currently we don't store this information anywhere. so, where are you
> planning to store it? I don't think we need to calculate it every time.
>
> Thanks,
> Ajantha
>
>
> On Sun, Feb 16, 2020 at 10:48 PM akashrn5 <[hidden email]> wrote:
>
>> Hi,
>>
>> >I got your point, but given the partition column by user does not help
>> reducing the information. If we want to reduce the >amount of the
>> information, we should ask user to give the filter on partition column
>> like
>> example 3 in my original mail.
>>
>> 1. my concern was if there are more partition column and instead of
>> partition value filter, i was thinking of having filter on partition
>> column
>>
>>
>> >Do you mean skip the partition columns in the SHOW SEGMENTS result? For
>> example 3.
>> Same as above point 1.
>>
>> >No, by showing example 1, actually I want to change the default output of
>> the SHOW SEGMENTS to those 6 fields only >in example 1.
>> >I suggest having both `spent` and `throughput` so that user does not need
>> to calculate himself.
>>
>> I agree, then what about the old default infos which we have now, like
>> data
>> size, index size, mergedTo,format? All these infos will be moved to *DESC
>> SEGMENT 2 ON table1*, if i am not wrong?
>>
>> Regards,
>> Akash R Nilugal
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>
>

akashrn5

Re: Improving show segment info

In reply to this post by Ajantha Bhat

Hi,

>>*1. How about creating a "tableName.segmentInfo" child table for each main
>>table?* user can query this table and easy to support filter, group by. we
>>just have to finalize the schema of this table.
We already have many things like index tables, datamap tables, just to store
this metadata, no need to create any table again,
maintaining would be difficult. Moreover show segments is not a more often
query, so better not to go for this.

>>2. For each partition to find out which all the segments it is mapped to,
>>currently we don't store this information anywhere. so, where are you
>>planning to store it? I don't think we need to calculate it every time.

We have a mapping right already, like table status file contain the load
name and segment file and corresponding segment file contains the partition
info with location.

Regards,
Akash R Nilugal

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

akashrn5

Re: Improving show segment info

In reply to this post by Ajantha Bhat

Hi Ajantha,

I think event time comes into picture when the user has the timestamp
column, like in timeseries. So only in that case, this column makes sense.

Else it won't be there.

@Likun, correct me if my understanding is wrong.

Regards,
Akash R Nilugal

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Jacky Li

Re: Improving show segment info

> 2020年2月17日下午2:03，akashrn5 <[hidden email]> 写道：
>
> Hi Ajantha,
>
> I think event time comes into picture when the user has the timestamp
> column, like in timeseries. So only in that case, this column makes sense.
>
> Else it won't be there.
>
> @Likun, correct me if my understanding is wrong.
>

Yes, you are right. If the user table has timestamp column that indicate the event time, user can specify it by setting a new table property called ‘event_time’, then carbon will know which column is event time column.

Then we can show it by reading the segment file (after PR3584, we will have minmax info in the segment file)/

>
> Regards,
> Akash R Nilugal
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Jacky Li

Re: Improving show segment info

In reply to this post by akashrn5

> 2020年2月17日下午2:00，akashrn5 <[hidden email]> 写道：
>
> Hi,
>
>>> *1. How about creating a "tableName.segmentInfo" child table for each main
>>> table?* user can query this table and easy to support filter, group by. we
>>> just have to finalize the schema of this table.
> We already have many things like index tables, datamap tables, just to store
> this metadata, no need to create any table again,
> maintaining would be difficult. Moreover show segments is not a more often
> query, so better not to go for this.

I agree. Initially I think of the way Ajantha suggested (adding a table to store the segment info), then I realize this work equals to refactoring to store the table status file into a Database, which required more effort and we decided not to do it in current phase.

We can do it like Ajantha suggested after moving the table status file.

>
>>> 2. For each partition to find out which all the segments it is mapped to,
>>> currently we don't store this information anywhere. so, where are you
>>> planning to store it? I don't think we need to calculate it every time.
>
> We have a mapping right already, like table status file contain the load
> name and segment file and corresponding segment file contains the partition
> info with location.

Yes, segment file has partition mapping info.
Now the new problem is, will it be very slow when reading a lot segment file? Suppose there are more than 5000 segment files, how can we execute the SHOW SEGMENTS faster?

>
> Regards,
> Akash R Nilugal
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

haomarch

Re: Improving show segment info

In reply to this post by Jacky Li

The problem users concerned is more likely: What partitions do not have
completely loaded?

If wer just add partition-segment list in "show partitions", or
"min/max(timecolumn)" in "show segemnts", the above question can't be
answered clearly.

The recommendation is, add a new cammand "show partitions with stage
details", to show whether each partition is loaded completely, for example

| dtm-20200219/hh=13 | incompletely load
| dtm-20200219/hh=14 | completely load
| dtm-20200219/hh=15 | completely load
| dtm-20200219/hh=16 | completely load
| dtm?20200219/hh=17 | incompletely load

From above, hh=13 is not loaded completely for some reason, It is clear that
The delay is start from hh=13.

===============================
Sum up, shall we add a new cammand "show partitions with stage details", to
show below infos:
1. whether loaded completely?
2. how many stages files have not been loaded
3. the start load time and end load time
4. the data volume of the partitions
5. so on....

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

haomarch

Re: Improving show segment info

In reply to this post by Jacky Li

1. show partition enhanced
I think it's more useful to see the data volume of each partiton,
futhermore, the data volume of every day will an important metric which
cann't be count now.
Sum up, if the partition column is "day, hour"
Shall we focus on these two problem:
Q1: How can we find the data volumn of each hour? ------we can aggreagete
the data volumne of each segment belong to this parition, it is easy.
Q2: How can we find the data volumn of each day? ------ May be add a option
in "show partiitons", like "show paritions groupby DAY"?

2. show load delay
Shall we add an option (dryRun = true) in the "insert stage" command, to
output the statitis of stages aren't loaded.

the output can be
| dtm-20200219/hh=13 | incompletely load, there are still 200 stages
waiting for loading
| dtm-20200219/hh=14 | completely load
| dtm-20200219/hh=15 | completely load
| dtm-20200219/hh=16 | completely load
| dtm?20200219/hh=17 | incompletely load, there are still 1800 stages
waiting for loading

2. show segment enhanced
Only coarse-grained statistics maybe better. Shall we just show the
paritition each segment belonging to, rather than outputing the
min(collect_time).

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/