carbondata performance test under benchmark tpc-ds

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

carbondata performance test under benchmark tpc-ds

李寅威
Hi all,


  I've made a simple performance test under benchmark tpc-ds using spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The details are as follows:


  About Env:
    Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
    Cluster: 5 nodes, 32G mem per node
  About TPC-DS:
    Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix '.csv' -dir /data/tpc-ds/data/)
    Max records num of the tables: table name - inventory, record num - 11,745,000
  About Performance Tuning:
    Spark:
      SPARK_WORKER_MEMORY=4g
      SPARK_WORKER_INSTANCES=4
    Carbondata:
      Leaving Default to avoid configuration difference.
  About Performance Test Result:
    SQL that can execute without modify: 70% (using sql template netezza)
    Max duration: 39.00s
    Min duration: 2.18s
    Average duration: 9.99s


  Well, I want to raise a discussion about the following topics:
    1. Is the hardware of the cluster reasonable? (what's the common hardware configuration about a spark/carbondata cluster [per node?])
    2. Is the result of the performance test resonable & explicable?
    3. Under interactive query circumstance, Is spark + carbondata an acceptable solution?
    4. Under interactive query circumstance, what's other solution may work well.(maybe the average query duration should less then 5s or even less)


  Thx very much ~
Reply | Threaded
Open this post in threaded view
|

Re: carbondata performance test under benchmark tpc-ds

李寅威
up↑


haha~~~




------------------ Original ------------------
From:  "ﻬ.贝壳里的海";<[hidden email]>;
Date:  Mon, Feb 20, 2017 09:52 AM
To:  "dev"<[hidden email]>;

Subject:  carbondata performance test under benchmark tpc-ds



Hi all,


  I've made a simple performance test under benchmark tpc-ds using spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The details are as follows:


  About Env:
    Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
    Cluster: 5 nodes, 32G mem per node
  About TPC-DS:
    Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix '.csv' -dir /data/tpc-ds/data/)
    Max records num of the tables: table name - inventory, record num - 11,745,000
  About Performance Tuning:
    Spark:
      SPARK_WORKER_MEMORY=4g
      SPARK_WORKER_INSTANCES=4
    Carbondata:
      Leaving Default to avoid configuration difference.
  About Performance Test Result:
    SQL that can execute without modify: 70% (using sql template netezza)
    Max duration: 39.00s
    Min duration: 2.18s
    Average duration: 9.99s


  Well, I want to raise a discussion about the following topics:
    1. Is the hardware of the cluster reasonable? (what's the common hardware configuration about a spark/carbondata cluster [per node?])
    2. Is the result of the performance test resonable & explicable?
    3. Under interactive query circumstance, Is spark + carbondata an acceptable solution?
    4. Under interactive query circumstance, what's other solution may work well.(maybe the average query duration should less then 5s or even less)


  Thx very much ~
Reply | Threaded
Open this post in threaded view
|

Re: carbondata performance test under benchmark tpc-ds

ravipesala
Hi,

We are working on TPC-H performance report now, and have improved the
performance with new format, we have already raised the PR(584 and 586) for
the same, It is still under review and it will be merged soon. Once these
PR's are merged we will start verify the TPC-DS performace as well.

Regards,
Ravindra.

On 21 February 2017 at 13:48, Yinwei Li <[hidden email]> wrote:

> up↑
>
>
> haha~~~
>
>
>
>
> ------------------ Original ------------------
> From:  "ﻬ.贝壳里的海";<[hidden email]>;
> Date:  Mon, Feb 20, 2017 09:52 AM
> To:  "dev"<[hidden email]>;
>
> Subject:  carbondata performance test under benchmark tpc-ds
>
>
>
> Hi all,
>
>
>   I've made a simple performance test under benchmark tpc-ds using
> spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The
> details are as follows:
>
>
>   About Env:
>     Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
>     Cluster: 5 nodes, 32G mem per node
>   About TPC-DS:
>     Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix
> '.csv' -dir /data/tpc-ds/data/)
>     Max records num of the tables: table name - inventory, record num -
> 11,745,000
>   About Performance Tuning:
>     Spark:
>       SPARK_WORKER_MEMORY=4g
>       SPARK_WORKER_INSTANCES=4
>     Carbondata:
>       Leaving Default to avoid configuration difference.
>   About Performance Test Result:
>     SQL that can execute without modify: 70% (using sql template netezza)
>     Max duration: 39.00s
>     Min duration: 2.18s
>     Average duration: 9.99s
>
>
>   Well, I want to raise a discussion about the following topics:
>     1. Is the hardware of the cluster reasonable? (what's the common
> hardware configuration about a spark/carbondata cluster [per node?])
>     2. Is the result of the performance test resonable & explicable?
>     3. Under interactive query circumstance, Is spark + carbondata an
> acceptable solution?
>     4. Under interactive query circumstance, what's other solution may
> work well.(maybe the average query duration should less then 5s or even
> less)
>
>
>   Thx very much ~
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

回复: carbondata performance test under benchmark tpc-ds

李寅威
Hi Ravindra,


thx for your reply, I'm so existed that you're working on this significant job,  and I'm looking forward to your performance test report based on tpc-h & tpc-ds.




------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月21日(星期二) 下午5:35
收件人: "dev"<[hidden email]>;

主题: Re: carbondata performance test under benchmark tpc-ds



Hi,

We are working on TPC-H performance report now, and have improved the
performance with new format, we have already raised the PR(584 and 586) for
the same, It is still under review and it will be merged soon. Once these
PR's are merged we will start verify the TPC-DS performace as well.

Regards,
Ravindra.

On 21 February 2017 at 13:48, Yinwei Li <[hidden email]> wrote:

> up↑
>
>
> haha~~~
>
>
>
>
> ------------------ Original ------------------
> From:  "ﻬ.贝壳里的海";<[hidden email]>;
> Date:  Mon, Feb 20, 2017 09:52 AM
> To:  "dev"<[hidden email]>;
>
> Subject:  carbondata performance test under benchmark tpc-ds
>
>
>
> Hi all,
>
>
>   I've made a simple performance test under benchmark tpc-ds using
> spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The
> details are as follows:
>
>
>   About Env:
>     Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
>     Cluster: 5 nodes, 32G mem per node
>   About TPC-DS:
>     Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix
> '.csv' -dir /data/tpc-ds/data/)
>     Max records num of the tables: table name - inventory, record num -
> 11,745,000
>   About Performance Tuning:
>     Spark:
>       SPARK_WORKER_MEMORY=4g
>       SPARK_WORKER_INSTANCES=4
>     Carbondata:
>       Leaving Default to avoid configuration difference.
>   About Performance Test Result:
>     SQL that can execute without modify: 70% (using sql template netezza)
>     Max duration: 39.00s
>     Min duration: 2.18s
>     Average duration: 9.99s
>
>
>   Well, I want to raise a discussion about the following topics:
>     1. Is the hardware of the cluster reasonable? (what's the common
> hardware configuration about a spark/carbondata cluster [per node?])
>     2. Is the result of the performance test resonable & explicable?
>     3. Under interactive query circumstance, Is spark + carbondata an
> acceptable solution?
>     4. Under interactive query circumstance, what's other solution may
> work well.(maybe the average query duration should less then 5s or even
> less)
>
>
>   Thx very much ~
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

carbondata vs. impala performance test under benchmark tpc-ds

李寅威
Hi all,

  I've made a simple performance test under benchmark tpc-ds using spark2.1.0+carbondata1.0.0 and Impala 2.7.0+parquet, well the result seems unsatisfactory. The details are as follows:

  About Env:
    Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
    Impala 2.7.0
    Cluster: 5 nodes, 32G mem per node
  About TPC-DS:
    Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix '.csv' -dir /data/tpc-ds/data/)
    Max records num of the tables: table name - inventory, record num - 11,745,000
  About Performance Tuning:
    Spark: 
      SPARK_WORKER_MEMORY=4g
      SPARK_WORKER_INSTANCES=4
    Carbondata:
      Leaving Default to avoid configuration difference.


  About Performance Test Result【Spark+CarbonData】:
    SQL that can execute without modify: 70% (using sql template netezza)
    Max duration: 39.00s
    Min duration: 2.18s
    Average duration: 9.99s
  
  About Performance Test Result【Impala+Parquert】:
    SQL that can execute without modify: 70% (using sql template netezza)
    Max duration: 16.75s
    Min duration: 0.42s
    Average duration: 2.18s

  U can get the details in the  attachment of this e-mail.

Sheet 1

SQLSpark + CarbonDataImpala + Parquet   
116.51 16.75    
214.43 16.48    
328.01 8.87    
43.53     
515.37 9.78    
611.19 1.84    
7     
8 0.75    
95.86 8.54    
106.95 1.42    
1117.40 1.06    
125.05     
13 1.05    
145.13 4.59    
15 1.13    
163.42 0.72    
176.25     
18 3.88    
194.04     
205.61     
2139.00 6.58    
225.03     
235.42 0.94    
247.13 1.06    
25 5.06    
26 1.08    
2711.28     
286.43 0.72    
299.87 1.07    
303.52 0.75    
315.49 0.96    
32 1.21    
33 0.51    
347.24 8.47    
353.96     
3611.89     
37 0.81    
38 0.70    
396.07 0.57    
40     
41 0.68    
429.70 1.42    
4332.35 1.89    
44 2.80    
45 0.68    
464.27 0.51    
479.67 1.37    
4815.03     
49 1.97    
504.07     
5117.81 2.17    
5236.15 1.34    
536.17 0.71    
549.78 1.52    
5512.23     
569.40     
577.17 0.62    
584.48 0.82    
59     
602.48 0.53    
61 0.92    
622.18 0.52    
63     
645.24     
65 0.43    
66 0.42    
67 1.61    
68 1.62    
69     
70     
7122.34     
7217.34     
7322.30 1.48    
744.72 1.16    
755.41 1.03    
764.51 0.72    
774.85     
78 0.79    
7916.80 1.17    
8011.38     
81     
823.81 0.71    
83     
845.17     
852.22 0.53    
863.59 0.74    
87     
885.39 0.82    
89 0.83    
90     
915.57 0.72    
926.16 0.95    
9310.12     
946.21     
957.63     
966.48     
9730.74 1.63    
98 0.85    
999.01 1.16    
10010.71 0.65    
101 1.20    
102     
1037.67 1.37    
AVG9.99 2.18    
COUNT69 69    



------------------ 原始邮件 ------------------
发件人: "ﻬ.贝壳里的海";<[hidden email]>;
发送时间: 2017年2月21日(星期二) 下午5:11
收件人: "dev"<[hidden email]>;
主题: 回复: carbondata performance test under benchmark tpc-ds

Hi Ravindra,


thx for your reply, I'm so existed that you're working on this significant job,  and I'm looking forward to your performance test report based on tpc-h & tpc-ds.




------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月21日(星期二) 下午5:35
收件人: "dev"<[hidden email]>;

主题: Re: carbondata performance test under benchmark tpc-ds



Hi,

We are working on TPC-H performance report now, and have improved the
performance with new format, we have already raised the PR(584 and 586) for
the same, It is still under review and it will be merged soon. Once these
PR's are merged we will start verify the TPC-DS performace as well.

Regards,
Ravindra.

On 21 February 2017 at 13:48, Yinwei Li <[hidden email]> wrote:

> up↑
>
>
> haha~~~
>
>
>
>
> ------------------ Original ------------------
> From:  "ﻬ.贝壳里的海";<[hidden email]>;
> Date:  Mon, Feb 20, 2017 09:52 AM
> To:  "dev"<[hidden email]>;
>
> Subject:  carbondata performance test under benchmark tpc-ds
>
>
>
> Hi all,
>
>
>   I've made a simple performance test under benchmark tpc-ds using
> spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The
> details are as follows:
>
>
>   About Env:
>     Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
>     Cluster: 5 nodes, 32G mem per node
>   About TPC-DS:
>     Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix
> '.csv' -dir /data/tpc-ds/data/)
>     Max records num of the tables: table name - inventory, record num -
> 11,745,000
>   About Performance Tuning:
>     Spark:
>       SPARK_WORKER_MEMORY=4g
>       SPARK_WORKER_INSTANCES=4
>     Carbondata:
>       Leaving Default to avoid configuration difference.
>   About Performance Test Result:
>     SQL that can execute without modify: 70% (using sql template netezza)
>     Max duration: 39.00s
>     Min duration: 2.18s
>     Average duration: 9.99s
>
>
>   Well, I want to raise a discussion about the following topics:
>     1. Is the hardware of the cluster reasonable? (what's the common
> hardware configuration about a spark/carbondata cluster [per node?])
>     2. Is the result of the performance test resonable & explicable?
>     3. Under interactive query circumstance, Is spark + carbondata an
> acceptable solution?
>     4. Under interactive query circumstance, what's other solution may
> work well.(maybe the average query duration should less then 5s or even
> less)
>
>
>   Thx very much ~
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: carbondata vs. impala performance test under benchmark tpc-ds

Liang Chen
Administrator
Hi

Thank you shared the test result.
It would be more reasonable if you could do the test comparison with same compute engine.
Spark 2.1+parquet , Spark 2.1+carbondata.
Are you interested in participating in doing this test along with us.(carbondata,parquet)

Regards
Liang

李寅威 wrote
Hi all,

  I've made a simple performance test under benchmark tpc-ds using spark2.1.0+carbondata1.0.0 and Impala 2.7.0+parquet, well the result seems unsatisfactory. The details are as follows:

  About Env:
    Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
    Impala 2.7.0
    Cluster: 5 nodes, 32G mem per node
  About TPC-DS:
    Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix '.csv' -dir /data/tpc-ds/data/)
    Max records num of the tables: table name - inventory, record num - 11,745,000
  About Performance Tuning:
    Spark:
      SPARK_WORKER_MEMORY=4g
      SPARK_WORKER_INSTANCES=4
    Carbondata:
      Leaving Default to avoid configuration difference.



  About Performance Test Result【Spark+CarbonData】:
    SQL that can execute without modify: 70% (using sql template netezza)
    Max duration: 39.00s
    Min duration: 2.18s
    Average duration: 9.99s
 
  About Performance Test Result【Impala+Parquert】:
    SQL that can execute without modify: 70% (using sql template netezza)
    Max duration: 16.75s
    Min duration: 0.42s
    Average duration: 2.18s


  U can get the details in the  attachment of this e-mail.



Sheet 1
                                                 SQLSpark + CarbonDataImpala + Parquet   
116.51 16.75    
214.43 16.48    
328.01 8.87    
43.53     
515.37 9.78    
611.19 1.84    
7     
8 0.75    
95.86 8.54    
106.95 1.42    
1117.40 1.06    
125.05     
13 1.05    
145.13 4.59    
15 1.13    
163.42 0.72    
176.25     
18 3.88    
194.04     
205.61     
2139.00 6.58    
225.03     
235.42 0.94    
247.13 1.06    
25 5.06    
26 1.08    
2711.28     
286.43 0.72    
299.87 1.07    
303.52 0.75    
315.49 0.96    
32 1.21    
33 0.51    
347.24 8.47    
353.96     
3611.89     
37 0.81    
38 0.70    
396.07 0.57    
40     
41 0.68    
429.70 1.42    
4332.35 1.89    
44 2.80    
45 0.68    
464.27 0.51    
479.67 1.37    
4815.03     
49 1.97    
504.07     
5117.81 2.17    
5236.15 1.34    
536.17 0.71    
549.78 1.52    
5512.23     
569.40     
577.17 0.62    
584.48 0.82    
59     
602.48 0.53    
61 0.92    
622.18 0.52    
63     
645.24     
65 0.43    
66 0.42    
67 1.61    
68 1.62    
69     
70     
7122.34     
7217.34     
7322.30 1.48    
744.72 1.16    
755.41 1.03    
764.51 0.72    
774.85     
78 0.79    
7916.80 1.17    
8011.38     
81     
823.81 0.71    
83     
845.17     
852.22 0.53    
863.59 0.74    
87     
885.39 0.82    
89 0.83    
90     
915.57 0.72    
926.16 0.95    
9310.12     
946.21     
957.63     
966.48     
9730.74 1.63    
98 0.85    
999.01 1.16    
10010.71 0.65    
101 1.20    
102     
1037.67 1.37    
AVG9.99 2.18    
COUNT69 69    
     








------------------ 原始邮件 ------------------
发件人: "ﻬ.贝壳里的海";<[hidden email]>;
发送时间: 2017年2月21日(星期二) 下午5:11
收件人: "dev"<[hidden email]>;

主题: 回复: carbondata performance test under benchmark tpc-ds



Hi Ravindra,


thx for your reply, I'm so existed that you're working on this significant job,  and I'm looking forward to your performance test report based on tpc-h & tpc-ds.




------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月21日(星期二) 下午5:35
收件人: "dev"<[hidden email]>;

主题: Re: carbondata performance test under benchmark tpc-ds



Hi,

We are working on TPC-H performance report now, and have improved the
performance with new format, we have already raised the PR(584 and 586) for
the same, It is still under review and it will be merged soon. Once these
PR's are merged we will start verify the TPC-DS performace as well.

Regards,
Ravindra.

On 21 February 2017 at 13:48, Yinwei Li <[hidden email]> wrote:

> up↑
>
>
> haha~~~
>
>
>
>
> ------------------ Original ------------------
> From:  "ﻬ.贝壳里的海";<[hidden email]>;
> Date:  Mon, Feb 20, 2017 09:52 AM
> To:  "dev"<[hidden email]>;
>
> Subject:  carbondata performance test under benchmark tpc-ds
>
>
>
> Hi all,
>
>
>   I've made a simple performance test under benchmark tpc-ds using
> spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The
> details are as follows:
>
>
>   About Env:
>     Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
>     Cluster: 5 nodes, 32G mem per node
>   About TPC-DS:
>     Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix
> '.csv' -dir /data/tpc-ds/data/)
>     Max records num of the tables: table name - inventory, record num -
> 11,745,000
>   About Performance Tuning:
>     Spark:
>       SPARK_WORKER_MEMORY=4g
>       SPARK_WORKER_INSTANCES=4
>     Carbondata:
>       Leaving Default to avoid configuration difference.
>   About Performance Test Result:
>     SQL that can execute without modify: 70% (using sql template netezza)
>     Max duration: 39.00s
>     Min duration: 2.18s
>     Average duration: 9.99s
>
>
>   Well, I want to raise a discussion about the following topics:
>     1. Is the hardware of the cluster reasonable? (what's the common
> hardware configuration about a spark/carbondata cluster [per node?])
>     2. Is the result of the performance test resonable & explicable?
>     3. Under interactive query circumstance, Is spark + carbondata an
> acceptable solution?
>     4. Under interactive query circumstance, what's other solution may
> work well.(maybe the average query duration should less then 5s or even
> less)
>
>
>   Thx very much ~
>



--
Thanks & Regards,
Ravi