Apache CarbonData Dev Mailing List archive

Feature Proposal: CarbonCli tool

Posted by Jacky Li on Sep 04, 2018; 5:10pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Feature-Proposal-CarbonCli-tool-tp61383.html

Hi All,

When I am tuning carbon performance, very often that I want to check the metadata in carbon files without launching spark shell or sql. In order to do that, I am writing a tool to print metadata information of a given data folder.
Currently, I am planning to do like this:

usage: CarbonCli
-a,--all print all information
-b,--tblProperties print table properties
-c,--column <column name> column to print statistics
-cmd <command name> command to execute, supported commands are:
summary
-d,--detailSize print each blocklet size
-h,--help print this message
-m,--showSegment print segment information
-p,--path <path> the path which contains carbondata files,
nested folder is supported
-s,--schema print the schema

In first phase, I think “summary” command is high priority, and developers can add more command in the future.

Summary command example as below, one good thing is that it can print out the column min/max value in percentage visually by using “———“, so that user have better understanding of the effectiveness of the carbon sort_columns set in create table.
Please suggest if you have any good idea on this tool.

➜ target git:(summary) java -jar carbondata-sdk.jar org.apache.carbondata.CarbonCli -cmd summary -p /opt/carbonstore/tpchcarbon_default/lineitem -a -c l_orderkey
Data Folder: /Users/jacky/code/spark-2.2.1-bin-hadoop2.7/carbonstore/tpchcarbon_default/lineitem
## Summary
10 blocks, 1 shards, 10 blocklets, 375 pages, 11,997,996 rows, 514.64MB

## Schema
schema in part-0-0_batchno0-0-1-1535726689954.carbondata
version: V3
timestamp: 2018-08-31 22:08:48.268
Column Name Data Type Column Type Property Encoding Schema Ordinal Id
l_orderkey INT dimension {sort_columns=true} [INVERTED_INDEX] 0 *0587
l_linenumber INT dimension {sort_columns=true} [INVERTED_INDEX] 3 *c981
l_suppkey STRING dimension [INVERTED_INDEX] 2 *75ae
l_returnflag STRING dimension [INVERTED_INDEX] 8 *4ae9
l_linestatus STRING dimension [INVERTED_INDEX] 9 *d358
l_shipdate DATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 10 *7cd0
l_commitdate DATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 11 *b192
l_receiptdate DATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 12 *b0dd
l_shipinstruct STRING dimension [INVERTED_INDEX] 13 *5db3
l_shipmode STRING dimension [INVERTED_INDEX] 14 *2308
l_comment STRING dimension [INVERTED_INDEX] 15 *4cef
l_partkey INT measure [] 1 *9bc7
l_quantity DOUBLE measure [] 4 *418c
l_extendedprice DOUBLE measure [] 5 *bf2c
l_discount DOUBLE measure [] 6 *2085
l_tax DOUBLE measure [] 7 *ad33

## Segment
SegmentID Status Load Start Load End Merged To Format Data Size Index Size
0 Marked for Delete 2018-08-31 2018-08-31 NA COLUMNAR_V3 NA NA
1 Success 2018-08-31 2018-08-31 NA COLUMNAR_V3 514.64MB 6.40KB

## Table Properties
Property Name Property Value
'sort_columns' 'l_orderkey,l_linenumber'
'table_blocksize' '64'
'comment' ''
'bad_records_path' ''
'local_dictionary_enable' 'false'

## Block Detail
Shard #1 (0_batchno0-0-1-1535726689954)
Block (PartNo) Blocklet #Pages #Rows Size
0 0 40 1280000 54.90MB
1 0 40 1280000 54.89MB
2 0 40 1280000 54.90MB
3 0 40 1280000 54.89MB
4 0 40 1280000 54.90MB
5 0 40 1280000 54.90MB
6 0 40 1280000 54.90MB
7 0 40 1280000 54.91MB
8 0 40 1280000 54.90MB
9 0 15 477996 20.50MB

## Column Statistics
Shard #1 (0_batchno0-0-1-1535726689954)
Block (PartNo) Blocklet Min/Max Range (100 characters)
0 0 ----------
1 0 -----------
2 0 ----------
3 0 -----------
4 0 -----------
5 0 -----------
6 0 ----------
7 0 -----------
8 0 -----------
9 0 ----