[DISCUSS] Improve Statistics and Profiling support

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Improve Statistics and Profiling support

Venkata Gollamudi
Hi,


In Carbondata currently LOG4J level "STATISTICS" is available to log.
How ever information is incomplete to debug performance problems and it is
not easy to see statistics and profiling information of one query at one
place.
So we need to relook and improve statistics and profiling.
I have put some pointers and can discuss regarding the same.



What to collect
---------------
1) Statistics of table/columns

 like no of files, no of blocks,no of blocklets


2) Profiling information required to debug peformance issue and resource
utilization.

 scan statistics like row size,no of block or blocklets scanned,
distribution info, scan buffer size.

 I/O and CPU/compute cost.

 driver index effectiveness: number of blocks hit

 executor index effectiveness: number of blocklet hit

 decoding and decompression cost and memory required.

 Cache statistics , hits, misses, memory occpied.

 Dictionary statistics: no of entries, dictionary load time, memory
occupied.

 Btree statistics: no of entries, Btree load time, lookup cost, memory
occupied.

3) Data load:

 load time, memory requried, encode, compress cost.

4) Spark time and Shuffle cost.


How to collect:
---------------
Check if can be plugin to spark metrics/counters system.
Have decorator statistics RDD in between to get each rdd, to collect
statistics  or any method to get from spark.
make it plug-able to integrate with other processing frameworks, so that we
can get end 2 end statistics.
Some thing like log4J with clean interfaces to put and retrieve information.



Where to store:
---------------
In separate table
In logs
History information , like it is stored in spark(may be json). Is spark
history statistics logging separate to use across frameworks?
Collector can collect statistics and can decide where to store.



How to see:
-----------
Command to retrieve various statistics and profiling info
Connecting to other metrics displays like spark UI or ganglia.


Links:
------
Profiling support in impala. http://www.cloudera.com/
documentation/enterprise/5-7-x/topics/impala_explain_plan.html#perf_profile
Table and column statistics in impala. http://www.cloudera.com/
documentation/enterprise/5-8-x/topics/impala_perf_stats.
html#perf_table_stats
spark metrics collection http://spark.apache.org/docs/
latest/monitoring.html#metrics


Regards,

Venkata Ramana Gollamudi