Login  Register

Carbon data vs parquet performance

Posted by Swapnil Shinde on Jul 23, 2017; 2:53am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Carbon-data-vs-parquet-performance-tp18687.html

Hello
   I am not sure what I am doing wrong but observing parquet running faster
than carbon data -

*Carbondata version - *1.1.0
*Data cardinality-*
     lineorder - 6B rows & date - 39,000 rows
*Query -*
     select sum(loExtendedprice*loDiscount) as revenue from lineorder, date
 where loOrderdate = dDatekey and dYear = 1993 and loDiscount between 1 and
3 and loQuantity < 25
*Filter factor for this query -*
     FF = (1/7)*0.5*(3/11) = 0.0194805.
     Number of lineorder rows selected, for SF = 1000,  is
0.0194805*6,000,000 ~ 116,883,000

*Lineorder carbon data table -*
    case class LineOrder (   loOrderkey: Long,   loLinenumber: Int,
loCustkey: Int,   loPartkey: Int,   loSuppkey: Int,   loOrderdate: Int,
loOrderpriority: String,   loShippriority: Int,   loQuantity: Int,
loExtendedprice: Int,   loOrdtotalprice: Int,   loDiscount: Int,
loRevenue: Int,   loSupplycost: Int,   loTax: Int,   loCommitdate: Int,
loShipmode: String,   dummy: String)
    *Options*: TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = loLinenumber,
loCustkey, loPartkey, loSuppkey, loOrderdate, loQuantity, loDiscount,
loRevenue, loCommitdate

*Date carbon data table -*
    case class Date (   dDatekey: Int,   dDate: String,   dDayofweek:
String,   dMonth: String,   dYear: Int,   dYearmonthnum: Int,   dYearmonth:
String,   dDaynuminweek: Int,   dDaynuminmonth: Int,   dDaynuminyear: Int,
  dMonthnuminyear: Int,   dWeeknuminyear: Int,   dSellingseason: String,
dLastdayinweekfl: Int,   dLastdayinmonthfl: Int,   dHolidayfl: Int,
dWeekdayfl: Int,   dummy: String)
     *Options:* TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = dDatekey, dYear,
dYearmonthnum, dDaynuminweek, dDaynuminmonth, dDaynuminyear,
dMonthnuminyear, dWeeknuminyear, dLastdayinweekfl, dLastdayinmonthfl,
dHolidayfl, dWeekdayfl

*Spark runtime configurations (Same for both parquet and carbon data) - *
    executor-memory = 15g, executor-cores = 6, dynamic allocation (I tried
different configurations as well)

Parquet runtime  = ~17 seconds
Carbon runtime = ~45 seconds.
    I tried changing TABLE_BLOCKSIZE to 256MB, 512MB but performance is
still >40 seconds. Both columns (loDiscount, loQuanity) both are
dimensions. I didn't know before that sort_columns property. Don't know if
including filter columns in "dictionary_include" is counter productive.

   Please suggest me any other configurations or options to improve
performance of above query. Help is very much appreciated..

Thanks
Swapnil