Login  Register

carbondata performance test under benchmark tpc-ds

Posted by 李寅威 on Feb 20, 2017; 1:52am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbondata-performance-test-under-benchmark-tpc-ds-tp7703.html

Hi all,


  I've made a simple performance test under benchmark tpc-ds using spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The details are as follows:


  About Env:
    Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
    Cluster: 5 nodes, 32G mem per node
  About TPC-DS:
    Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix '.csv' -dir /data/tpc-ds/data/)
    Max records num of the tables: table name - inventory, record num - 11,745,000
  About Performance Tuning:
    Spark:
      SPARK_WORKER_MEMORY=4g
      SPARK_WORKER_INSTANCES=4
    Carbondata:
      Leaving Default to avoid configuration difference.
  About Performance Test Result:
    SQL that can execute without modify: 70% (using sql template netezza)
    Max duration: 39.00s
    Min duration: 2.18s
    Average duration: 9.99s


  Well, I want to raise a discussion about the following topics:
    1. Is the hardware of the cluster reasonable? (what's the common hardware configuration about a spark/carbondata cluster [per node?])
    2. Is the result of the performance test resonable & explicable?
    3. Under interactive query circumstance, Is spark + carbondata an acceptable solution?
    4. Under interactive query circumstance, what's other solution may work well.(maybe the average query duration should less then 5s or even less)


  Thx very much ~