Hi All,
When I am tuning carbon performance, very often that I want to check the metadata in carbon files without launching spark shell or sql. In order to do that, I am writing a tool to print metadata information of a given data folder. Currently, I am planning to do like this: usage: CarbonCli -a,--all print all information -b,--tblProperties print table properties -c,--column <column name> column to print statistics -cmd <command name> command to execute, supported commands are: summary -d,--detailSize print each blocklet size -h,--help print this message -m,--showSegment print segment information -p,--path <path> the path which contains carbondata files, nested folder is supported -s,--schema print the schema In first phase, I think “summary” command is high priority, and developers can add more command in the future. Summary command example as below, one good thing is that it can print out the column min/max value in percentage visually by using “———“, so that user have better understanding of the effectiveness of the carbon sort_columns set in create table. Please suggest if you have any good idea on this tool. ➜ target git:(summary) java -jar carbondata-sdk.jar org.apache.carbondata.CarbonCli -cmd summary -p /opt/carbonstore/tpchcarbon_default/lineitem -a -c l_orderkey Data Folder: /Users/jacky/code/spark-2.2.1-bin-hadoop2.7/carbonstore/tpchcarbon_default/lineitem ## Summary 10 blocks, 1 shards, 10 blocklets, 375 pages, 11,997,996 rows, 514.64MB ## Schema schema in part-0-0_batchno0-0-1-1535726689954.carbondata version: V3 timestamp: 2018-08-31 22:08:48.268 Column Name Data Type Column Type Property Encoding Schema Ordinal Id l_orderkey INT dimension {sort_columns=true} [INVERTED_INDEX] 0 *0587 l_linenumber INT dimension {sort_columns=true} [INVERTED_INDEX] 3 *c981 l_suppkey STRING dimension [INVERTED_INDEX] 2 *75ae l_returnflag STRING dimension [INVERTED_INDEX] 8 *4ae9 l_linestatus STRING dimension [INVERTED_INDEX] 9 *d358 l_shipdate DATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 10 *7cd0 l_commitdate DATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 11 *b192 l_receiptdate DATE dimension [DICTIONARY, DIRECT_DICTIONARY, INVERTED_INDEX] 12 *b0dd l_shipinstruct STRING dimension [INVERTED_INDEX] 13 *5db3 l_shipmode STRING dimension [INVERTED_INDEX] 14 *2308 l_comment STRING dimension [INVERTED_INDEX] 15 *4cef l_partkey INT measure [] 1 *9bc7 l_quantity DOUBLE measure [] 4 *418c l_extendedprice DOUBLE measure [] 5 *bf2c l_discount DOUBLE measure [] 6 *2085 l_tax DOUBLE measure [] 7 *ad33 ## Segment SegmentID Status Load Start Load End Merged To Format Data Size Index Size 0 Marked for Delete 2018-08-31 2018-08-31 NA COLUMNAR_V3 NA NA 1 Success 2018-08-31 2018-08-31 NA COLUMNAR_V3 514.64MB 6.40KB ## Table Properties Property Name Property Value 'sort_columns' 'l_orderkey,l_linenumber' 'table_blocksize' '64' 'comment' '' 'bad_records_path' '' 'local_dictionary_enable' 'false' ## Block Detail Shard #1 (0_batchno0-0-1-1535726689954) Block (PartNo) Blocklet #Pages #Rows Size 0 0 40 1280000 54.90MB 1 0 40 1280000 54.89MB 2 0 40 1280000 54.90MB 3 0 40 1280000 54.89MB 4 0 40 1280000 54.90MB 5 0 40 1280000 54.90MB 6 0 40 1280000 54.90MB 7 0 40 1280000 54.91MB 8 0 40 1280000 54.90MB 9 0 15 477996 20.50MB ## Column Statistics Shard #1 (0_batchno0-0-1-1535726689954) Block (PartNo) Blocklet Min/Max Range (100 characters) 0 0 ---------- 1 0 ----------- 2 0 ---------- 3 0 ----------- 4 0 ----------- 5 0 ----------- 6 0 ---------- 7 0 ----------- 8 0 ----------- 9 0 ---- |
In the above example, you specify one directory and get two segments.
But it only shows one schema info. I thought the number of the schema is the same as that of data directories. Since you mentioned that we can support nested folder, what if the schema in these files are not the same? Another problem: SegmentID Status Load Start Load End Merged To Format Data Size Index Size 0 Marked for Delete 2018-08-31 2018-08-31 NA COLUMNAR_V3 NA NA Why The datasize for segment#0 is NA? Will it affect the total data size of the carbon table? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
For “summary” command, I just pick the first carbondata file and read the schema from its header.
The intention here is just to show one schema, assuming all schema in all data files in this folder is the same. If there is need to validate schema in all files, we can add a “validate” command. It will be easy to add in CarbonCli tool I have raised PR2683 for this feature. Regards, Jacky > 在 2018年9月5日,上午10:23,xuchuanyin <[hidden email]> 写道: > > In the above example, you specify one directory and get two segments. > But it only shows one schema info. I thought the number of the schema is the > same as that of data directories. Since you mentioned that we can support > nested folder, what if the schema in these files are not the same? > > Another problem: > > SegmentID Status Load Start Load End Merged To Format > Data Size Index Size > 0 Marked for Delete 2018-08-31 2018-08-31 NA COLUMNAR_V3 > NA NA > > Why The datasize for segment#0 is NA? Will it affect the total data size of > the carbon table? > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
-c,--column <column name> column to print statistics
--- Do we support multiple columns and how to use it? I think the current short name for cli command is not clear. Before it is stable, I do recommend to use the full name instead of short name. For example -b for --tblProperties is not suitable... -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Jacky Li
Hi , I have following doubts and suggestions for this tool. 1. To which module you are planning to keep this tool. Ideally, it should be under tools folder and going forward we can add more tools like this under it. 2. Which file schema are you printing? are you randomly choosing the file to read? Better we can also take the file name input to read schema from that file. It will be useful for debugging. 3. I don't get how the percentage calculated with min/max ? And how it will be helpful to user. Can you give some example? 4. It would be better to print each columns size also. It will be helpful for debugging. Regards, Ravindra. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Reply inline
> 在 2018年9月5日,下午3:46,ravipesala <[hidden email]> 写道: > > > Hi , > > I have following doubts and suggestions for this tool. > > 1. To which module you are planning to keep this tool. Ideally, it should be > under tools folder and going forward we can add more tools like this under > it. Sure, I will create a tool module and put it there. > > 2. Which file schema are you printing? are you randomly choosing the file to > read? Better we can also take the file name input to read schema from that > file. It will be useful for debugging. I am printing the schema in first file. OK, print schema for specified file can be added > > 3. I don't get how the percentage calculated with min/max ? And how it will > be helpful to user. Can you give some example? > It will print whether the min/max in each blocklet by “———“, then user can know whether they are overlapping. If they are overlap, it means the min/max is not so effective. > 4. It would be better to print each columns size also. It will be helpful > for debugging. > Yes, it will print the column size when you give “-c columnName" > > > > Regards, > Ravindra. > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Free forum by Nabble | Edit this page |