Apache CarbonData Dev Mailing List archive

carbondata performance test under benchmark tpc-ds

Posted by 李寅威 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbondata-performance-test-under-benchmark-tpc-ds-tp7703.html

Hi all,

I've made a simple performance test under benchmark tpc-ds using spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The details are as follows:

About Env:
Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
Cluster: 5 nodes, 32G mem per node
About TPC-DS:
Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix '.csv' -dir /data/tpc-ds/data/)
Max records num of the tables: table name - inventory, record num - 11,745,000
About Performance Tuning:
Spark:
SPARK_WORKER_MEMORY=4g
SPARK_WORKER_INSTANCES=4
Carbondata:
Leaving Default to avoid configuration difference.
About Performance Test Result:
SQL that can execute without modify: 70% (using sql template netezza)
Max duration: 39.00s
Min duration: 2.18s
Average duration: 9.99s

Well, I want to raise a discussion about the following topics:
1. Is the hardware of the cluster reasonable? (what's the common hardware configuration about a spark/carbondata cluster [per node?])
2. Is the result of the performance test resonable & explicable?
3. Under interactive query circumstance, Is spark + carbondata an acceptable solution?
4. Under interactive query circumstance, what's other solution may work well.(maybe the average query duration should less then 5s or even less)

Thx very much ~