Apache CarbonData Dev Mailing List archive

Carbon data vs parquet performance

Classic

List

Threaded

2 messages Options

Swapnil Shinde

Carbon data vs parquet performance

Hello
I am not sure what I am doing wrong but observing parquet running faster
than carbon data -

*Carbondata version - *1.1.0
*Data cardinality-*
lineorder - 6B rows & date - 39,000 rows
*Query -*
select sum(loExtendedprice*loDiscount) as revenue from lineorder, date
where loOrderdate = dDatekey and dYear = 1993 and loDiscount between 1 and
3 and loQuantity < 25
*Filter factor for this query -*
FF = (1/7)*0.5*(3/11) = 0.0194805.
Number of lineorder rows selected, for SF = 1000, is
0.0194805*6,000,000 ~ 116,883,000

*Lineorder carbon data table -*
case class LineOrder ( loOrderkey: Long, loLinenumber: Int,
loCustkey: Int, loPartkey: Int, loSuppkey: Int, loOrderdate: Int,
loOrderpriority: String, loShippriority: Int, loQuantity: Int,
loExtendedprice: Int, loOrdtotalprice: Int, loDiscount: Int,
loRevenue: Int, loSupplycost: Int, loTax: Int, loCommitdate: Int,
loShipmode: String, dummy: String)
*Options*: TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = loLinenumber,
loCustkey, loPartkey, loSuppkey, loOrderdate, loQuantity, loDiscount,
loRevenue, loCommitdate

*Date carbon data table -*
case class Date ( dDatekey: Int, dDate: String, dDayofweek:
String, dMonth: String, dYear: Int, dYearmonthnum: Int, dYearmonth:
String, dDaynuminweek: Int, dDaynuminmonth: Int, dDaynuminyear: Int,
dMonthnuminyear: Int, dWeeknuminyear: Int, dSellingseason: String,
dLastdayinweekfl: Int, dLastdayinmonthfl: Int, dHolidayfl: Int,
dWeekdayfl: Int, dummy: String)
*Options:* TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = dDatekey, dYear,
dYearmonthnum, dDaynuminweek, dDaynuminmonth, dDaynuminyear,
dMonthnuminyear, dWeeknuminyear, dLastdayinweekfl, dLastdayinmonthfl,
dHolidayfl, dWeekdayfl

*Spark runtime configurations (Same for both parquet and carbon data) - *
executor-memory = 15g, executor-cores = 6, dynamic allocation (I tried
different configurations as well)

Parquet runtime = ~17 seconds
Carbon runtime = ~45 seconds.
I tried changing TABLE_BLOCKSIZE to 256MB, 512MB but performance is
still >40 seconds. Both columns (loDiscount, loQuanity) both are
dimensions. I didn't know before that sort_columns property. Don't know if
including filter columns in "dictionary_include" is counter productive.

Please suggest me any other configurations or options to improve
performance of above query. Help is very much appreciated..

Thanks
Swapnil

ravipesala

Re: Carbon data vs parquet performance

Hi Swapnil,

Yes, keeping the measures as dimensions would be counter productive depends
on the following points
1. Location of those columns in mdk key.
2. Selection of values of the filter applied on those columns, for suppose
if the loDiscount between 1 and 3 gives thousands of values then it needs
to apply those values as in filter so it would be slower.

So it is better not include or exclude dictionary on the measure columns.
It will always comes with extra burden on IO if include more columns in
dictionary exclude/include. In version 1.1.0 there is no measure filter
optimization so queries would be slower on measure filters. But version
1.1.1 is already released and it has measure filter optimization so it
would be better to take this version.

It is better to keep table block size as 300MB and blocklet size as 128MB
(configured through carbon.blockletgroup.size.in.mb of carbon property file ).

And regrading sort_columns feature, it does not exist on any released
version so please leave about it unless you want to try it on master
branch. we are planning to release 1.2.0 version next month and this
feature would be included in it.

Also please observe following things.
1. Date table supposed to be small so spark should do broadcast join,
please observe on carbon query whether it does broadcast join or not.
2. How many tasks are launched in carbon compare to parquet after above
configurations. Is there any task are more slower compare to other tasks.

Regards,
Ravindra.

On 23 July 2017 at 08:23, Swapnil Shinde <[hidden email]> wrote:

> Hello
> I am not sure what I am doing wrong but observing parquet running faster
> than carbon data -
>
> *Carbondata version - *1.1.0
> *Data cardinality-*
> lineorder - 6B rows & date - 39,000 rows
> *Query -*
> select sum(loExtendedprice*loDiscount) as revenue from lineorder,
> date
> where loOrderdate = dDatekey and dYear = 1993 and loDiscount between 1 and
> 3 and loQuantity < 25
> *Filter factor for this query -*
> FF = (1/7)*0.5*(3/11) = 0.0194805.
> Number of lineorder rows selected, for SF = 1000, is
> 0.0194805*6,000,000 ~ 116,883,000
>
> *Lineorder carbon data table -*
> case class LineOrder ( loOrderkey: Long, loLinenumber: Int,
> loCustkey: Int, loPartkey: Int, loSuppkey: Int, loOrderdate: Int,
> loOrderpriority: String, loShippriority: Int, loQuantity: Int,
> loExtendedprice: Int, loOrdtotalprice: Int, loDiscount: Int,
> loRevenue: Int, loSupplycost: Int, loTax: Int, loCommitdate: Int,
> loShipmode: String, dummy: String)
> *Options*: TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = loLinenumber,
> loCustkey, loPartkey, loSuppkey, loOrderdate, loQuantity, loDiscount,
> loRevenue, loCommitdate
>
> *Date carbon data table -*
> case class Date ( dDatekey: Int, dDate: String, dDayofweek:
> String, dMonth: String, dYear: Int, dYearmonthnum: Int, dYearmonth:
> String, dDaynuminweek: Int, dDaynuminmonth: Int, dDaynuminyear: Int,
> dMonthnuminyear: Int, dWeeknuminyear: Int, dSellingseason: String,
> dLastdayinweekfl: Int, dLastdayinmonthfl: Int, dHolidayfl: Int,
> dWeekdayfl: Int, dummy: String)
> *Options:* TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = dDatekey, dYear,
> dYearmonthnum, dDaynuminweek, dDaynuminmonth, dDaynuminyear,
> dMonthnuminyear, dWeeknuminyear, dLastdayinweekfl, dLastdayinmonthfl,
> dHolidayfl, dWeekdayfl
>
> *Spark runtime configurations (Same for both parquet and carbon data) - *
> executor-memory = 15g, executor-cores = 6, dynamic allocation (I tried
> different configurations as well)
>
> Parquet runtime = ~17 seconds
> Carbon runtime = ~45 seconds.
> I tried changing TABLE_BLOCKSIZE to 256MB, 512MB but performance is
> still >40 seconds. Both columns (loDiscount, loQuanity) both are
> dimensions. I didn't know before that sort_columns property. Don't know if
> including filter columns in "dictionary_include" is counter productive.
>
> Please suggest me any other configurations or options to improve
> performance of above query. Help is very much appreciated..
>
> Thanks
> Swapnil
>

--
Thanks & Regards,
Ravi