Carbon data vs parquet performance
Posted by Swapnil Shinde on Jul 23, 2017; 2:53am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Carbon-data-vs-parquet-performance-tp18687.html
Hello
I am not sure what I am doing wrong but observing parquet running faster
than carbon data -
*Carbondata version - *1.1.0
*Data cardinality-*
lineorder - 6B rows & date - 39,000 rows
*Query -*
select sum(loExtendedprice*loDiscount) as revenue from lineorder, date
where loOrderdate = dDatekey and dYear = 1993 and loDiscount between 1 and
3 and loQuantity < 25
*Filter factor for this query -*
FF = (1/7)*0.5*(3/11) = 0.0194805.
Number of lineorder rows selected, for SF = 1000, is
0.0194805*6,000,000 ~ 116,883,000
*Lineorder carbon data table -*
case class LineOrder ( loOrderkey: Long, loLinenumber: Int,
loCustkey: Int, loPartkey: Int, loSuppkey: Int, loOrderdate: Int,
loOrderpriority: String, loShippriority: Int, loQuantity: Int,
loExtendedprice: Int, loOrdtotalprice: Int, loDiscount: Int,
loRevenue: Int, loSupplycost: Int, loTax: Int, loCommitdate: Int,
loShipmode: String, dummy: String)
*Options*: TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = loLinenumber,
loCustkey, loPartkey, loSuppkey, loOrderdate, loQuantity, loDiscount,
loRevenue, loCommitdate
*Date carbon data table -*
case class Date ( dDatekey: Int, dDate: String, dDayofweek:
String, dMonth: String, dYear: Int, dYearmonthnum: Int, dYearmonth:
String, dDaynuminweek: Int, dDaynuminmonth: Int, dDaynuminyear: Int,
dMonthnuminyear: Int, dWeeknuminyear: Int, dSellingseason: String,
dLastdayinweekfl: Int, dLastdayinmonthfl: Int, dHolidayfl: Int,
dWeekdayfl: Int, dummy: String)
*Options:* TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = dDatekey, dYear,
dYearmonthnum, dDaynuminweek, dDaynuminmonth, dDaynuminyear,
dMonthnuminyear, dWeeknuminyear, dLastdayinweekfl, dLastdayinmonthfl,
dHolidayfl, dWeekdayfl
*Spark runtime configurations (Same for both parquet and carbon data) - *
executor-memory = 15g, executor-cores = 6, dynamic allocation (I tried
different configurations as well)
Parquet runtime = ~17 seconds
Carbon runtime = ~45 seconds.
I tried changing TABLE_BLOCKSIZE to 256MB, 512MB but performance is
still >40 seconds. Both columns (loDiscount, loQuanity) both are
dimensions. I didn't know before that sort_columns property. Don't know if
including filter columns in "dictionary_include" is counter productive.
Please suggest me any other configurations or options to improve
performance of above query. Help is very much appreciated..
Thanks
Swapnil