Improving show segment info
Posted by Jacky Li on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improving-show-segment-info-tp91874.html
Hi community,
Currently for SHOW SEGMENT command, carbon will print:
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+----+
|SegmentSequenceId |Status |Load Start Time |Load End Time |Merged To|File Format|Data Size|Index Size|Path|
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+----+
|5 |Compacted|2020-02-15 22:19:06.327|2020-02-15 22:19:06.516|4.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|4.1|Success |2020-02-15 22:19:06.327|2020-02-15 22:19:06.659|NA |COLUMNAR_V3|640.0B |425.0B |NA |
|4 |Compacted|2020-02-15 22:19:05.894|2020-02-15 22:19:06.188|4.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|3 |Compacted|2020-02-15 22:19:05.151|2020-02-15 22:19:05.391|2.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|2.1|Compacted|2020-02-15 22:19:05.151|2020-02-15 22:19:05.592|0.2 |COLUMNAR_V3|640.0B |425.0B |NA |
|2 |Compacted|2020-02-15 22:19:04.804|2020-02-15 22:19:05.023|2.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|1 |Compacted|2020-02-15 22:19:04.006|2020-02-15 22:19:04.242|0.1 |COLUMNAR_V3|630.0B |421.0B |NA |
|0.2|Success |2020-02-15 22:19:05.151|2020-02-15 22:19:05.79 |NA |COLUMNAR_V3|658.0B |425.0B |NA |
|0.1|Compacted|2020-02-15 22:19:04.006|2020-02-15 22:19:04.704|0.2 |COLUMNAR_V3|640.0B |425.0B |NA |
|0 |Compacted|2020-02-15 22:19:02.335|2020-02-15 22:19:03.775|0.1 |COLUMNAR_V3|630.0B |421.0B |NA |
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+——+
While it prints basic info of segments, it does not support following usecase:
UserCase 1. Show for partition table
For partition table, have no way to know about the which partition that each segment mapped to. The partition-segment mapping is two way mapping, user may want to check which segments is belong to certain partition.
==> Suggest to add partition information for each segment when showing segments
UserCase 2. Show load delay
Have no way to know about the event time in the segment. By event time, I mean the timestamp column value in the record being loaded. In many cases, user wanted to know the load delay which indicate whether the system loading is catching up with the incoming data.
==> For this requirement, one suggestion is that we can show the minmax of the event time column in SHOW SEGMENTS command, then user can know about the difference of the Load Start Time and Event Time in the record. This solution depends on Segment level minmax (PR3584), and we may need user to specify the event_time column name in table property when creating table or use alter table to add it for existing table.
UserCase 3. Show load speed
Currenlty user need to calculate the load speed by Load Start Time and Load End Time, it is not so convenient for user
==> We can keep the Load Start Time and calculate the elapsed time and throughput(in MBps) by SHOW SEGMENTS automatically.
UserCase 4. Show more information and give more control to user, like filtering and sorting
Besides above information, actually there is more information kept in segment metadata but have not shown to the user. And sometimes user wanted to sort the output by certain field, like sort by load time
==> To show more detail info of the segment, I can think of two ways:
1) add SHOW SEGMENTS EXTENDED option
2) To avoid making the output table too wide to read, I prefer to add DESC SEGMENTS command to show detail info for one segment by specifying the segment id by user.
And, to give more control for user, we can add some options for sorting and filter.
In summary, to support all above use cases, I suggest to change the command like:
Example 1:
// show basic info
SHOW SEGMENTS ON table1
will output
+---+---------+-----------------------+-----------------------+------+-----------+
|ID |Status |Load Start Time |Event Time Min |Spent |Throughput |
+---+---------+-----------------------+-----------------------+------+-----------+
|2 |Success |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|3m |28MBps |
|1 |Success |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|28s |30MBps |
|0 |Success |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|4m6s |25MBps |
+---+---------+-----------------------+-----------------------+------+-----------+
this is short enough so that it can fit in most of the console line width
Example 2:
// To show partition for each segment
SHOW SEGMENTS ON table1 PARTITION
output:
+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status |Load Start Time |Event Time Min |PARTITION |
+---+---------+-----------------------+-----------------------+--------------------+
|2 |Success |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|dt=29,planet=earth |
|1 |Success |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth |
|0 |Success |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|dt=34,planet=mars |
+---+---------+-----------------------+-----------------------+--------------------+
Example 3:
// To filter out one partition
SHOW SEGMENTS ON table1 OPTOINS (‘partition’=‘dt=30,planet=earth’)
+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status |Load Start Time |Event Time Min |PARTITION |
+---+---------+-----------------------+-----------------------+--------------------+
|1 |Success |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth |
+---+---------+-----------------------+-----------------------+——————————+
Example 4:
// To sort by load start time
SHOW SEGMENTS ON table1 OPTOINS (’sort_by’=‘load_start_time')
+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status |Load Start Time |Event Time Min |PARTITION |
+---+---------+-----------------------+-----------------------+--------------------+
|0 |Success |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|dt=34,planet=mars |
|1 |Success |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth |
|2 |Success |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|dt=29,planet=earth |
+---+---------+-----------------------+-----------------------+--------------------+
The sort field can be: id, load_start_time, event_time_min, event_time_max, data_size, index_size, and column_name_min or column_name_max, where column_name can be any literal string and exist in the table schema.
Example 5:
// To show more detail info
DESC SEGMENT 2 ON table1
This will show all information for segment id 2. Information including not only table status file but also info in segment file.
I hope this suggestion can make segment info more visible to the user.
What do you think?
Regards,
Jacky