Hi,
In Carbondata currently LOG4J level "STATISTICS" is available to log.
How ever information is incomplete to debug performance problems and it is
not easy to see statistics and profiling information of one query at one
place.
So we need to relook and improve statistics and profiling.
I have put some pointers and can discuss regarding the same.
What to collect
---------------
1) Statistics of table/columns
like no of files, no of blocks,no of blocklets
2) Profiling information required to debug peformance issue and resource
utilization.
scan statistics like row size,no of block or blocklets scanned,
distribution info, scan buffer size.
I/O and CPU/compute cost.
driver index effectiveness: number of blocks hit
executor index effectiveness: number of blocklet hit
decoding and decompression cost and memory required.
Cache statistics , hits, misses, memory occpied.
Dictionary statistics: no of entries, dictionary load time, memory
occupied.
Btree statistics: no of entries, Btree load time, lookup cost, memory
occupied.
3) Data load:
load time, memory requried, encode, compress cost.
4) Spark time and Shuffle cost.
How to collect:
---------------
Check if can be plugin to spark metrics/counters system.
Have decorator statistics RDD in between to get each rdd, to collect
statistics or any method to get from spark.
make it plug-able to integrate with other processing frameworks, so that we
can get end 2 end statistics.
Some thing like log4J with clean interfaces to put and retrieve information.
Where to store:
---------------
In separate table
In logs
History information , like it is stored in spark(may be json). Is spark
history statistics logging separate to use across frameworks?
Collector can collect statistics and can decide where to store.
How to see:
-----------
Command to retrieve various statistics and profiling info
Connecting to other metrics displays like spark UI or ganglia.
Links:
------
Profiling support in impala.
http://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_explain_plan.html#perf_profile
Table and column statistics in impala.
http://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_stats.
html#perf_table_stats
spark metrics collection
http://spark.apache.org/docs/latest/monitoring.html#metrics
Regards,
Venkata Ramana Gollamudi