Hello
I am not sure what I am doing wrong but observing parquet running faster than carbon data - *Carbondata version - *1.1.0 *Data cardinality-* lineorder - 6B rows & date - 39,000 rows *Query -* select sum(loExtendedprice*loDiscount) as revenue from lineorder, date where loOrderdate = dDatekey and dYear = 1993 and loDiscount between 1 and 3 and loQuantity < 25 *Filter factor for this query -* FF = (1/7)*0.5*(3/11) = 0.0194805. Number of lineorder rows selected, for SF = 1000, is 0.0194805*6,000,000 ~ 116,883,000 *Lineorder carbon data table -* case class LineOrder ( loOrderkey: Long, loLinenumber: Int, loCustkey: Int, loPartkey: Int, loSuppkey: Int, loOrderdate: Int, loOrderpriority: String, loShippriority: Int, loQuantity: Int, loExtendedprice: Int, loOrdtotalprice: Int, loDiscount: Int, loRevenue: Int, loSupplycost: Int, loTax: Int, loCommitdate: Int, loShipmode: String, dummy: String) *Options*: TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = loLinenumber, loCustkey, loPartkey, loSuppkey, loOrderdate, loQuantity, loDiscount, loRevenue, loCommitdate *Date carbon data table -* case class Date ( dDatekey: Int, dDate: String, dDayofweek: String, dMonth: String, dYear: Int, dYearmonthnum: Int, dYearmonth: String, dDaynuminweek: Int, dDaynuminmonth: Int, dDaynuminyear: Int, dMonthnuminyear: Int, dWeeknuminyear: Int, dSellingseason: String, dLastdayinweekfl: Int, dLastdayinmonthfl: Int, dHolidayfl: Int, dWeekdayfl: Int, dummy: String) *Options:* TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = dDatekey, dYear, dYearmonthnum, dDaynuminweek, dDaynuminmonth, dDaynuminyear, dMonthnuminyear, dWeeknuminyear, dLastdayinweekfl, dLastdayinmonthfl, dHolidayfl, dWeekdayfl *Spark runtime configurations (Same for both parquet and carbon data) - * executor-memory = 15g, executor-cores = 6, dynamic allocation (I tried different configurations as well) Parquet runtime = ~17 seconds Carbon runtime = ~45 seconds. I tried changing TABLE_BLOCKSIZE to 256MB, 512MB but performance is still >40 seconds. Both columns (loDiscount, loQuanity) both are dimensions. I didn't know before that sort_columns property. Don't know if including filter columns in "dictionary_include" is counter productive. Please suggest me any other configurations or options to improve performance of above query. Help is very much appreciated.. Thanks Swapnil |
Hi Swapnil,
Yes, keeping the measures as dimensions would be counter productive depends on the following points 1. Location of those columns in mdk key. 2. Selection of values of the filter applied on those columns, for suppose if the loDiscount between 1 and 3 gives thousands of values then it needs to apply those values as in filter so it would be slower. So it is better not include or exclude dictionary on the measure columns. It will always comes with extra burden on IO if include more columns in dictionary exclude/include. In version 1.1.0 there is no measure filter optimization so queries would be slower on measure filters. But version 1.1.1 is already released and it has measure filter optimization so it would be better to take this version. It is better to keep table block size as 300MB and blocklet size as 128MB (configured through carbon.blockletgroup.size.in.mb of carbon property file ). And regrading sort_columns feature, it does not exist on any released version so please leave about it unless you want to try it on master branch. we are planning to release 1.2.0 version next month and this feature would be included in it. Also please observe following things. 1. Date table supposed to be small so spark should do broadcast join, please observe on carbon query whether it does broadcast join or not. 2. How many tasks are launched in carbon compare to parquet after above configurations. Is there any task are more slower compare to other tasks. Regards, Ravindra. On 23 July 2017 at 08:23, Swapnil Shinde <[hidden email]> wrote: > Hello > I am not sure what I am doing wrong but observing parquet running faster > than carbon data - > > *Carbondata version - *1.1.0 > *Data cardinality-* > lineorder - 6B rows & date - 39,000 rows > *Query -* > select sum(loExtendedprice*loDiscount) as revenue from lineorder, > date > where loOrderdate = dDatekey and dYear = 1993 and loDiscount between 1 and > 3 and loQuantity < 25 > *Filter factor for this query -* > FF = (1/7)*0.5*(3/11) = 0.0194805. > Number of lineorder rows selected, for SF = 1000, is > 0.0194805*6,000,000 ~ 116,883,000 > > *Lineorder carbon data table -* > case class LineOrder ( loOrderkey: Long, loLinenumber: Int, > loCustkey: Int, loPartkey: Int, loSuppkey: Int, loOrderdate: Int, > loOrderpriority: String, loShippriority: Int, loQuantity: Int, > loExtendedprice: Int, loOrdtotalprice: Int, loDiscount: Int, > loRevenue: Int, loSupplycost: Int, loTax: Int, loCommitdate: Int, > loShipmode: String, dummy: String) > *Options*: TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = loLinenumber, > loCustkey, loPartkey, loSuppkey, loOrderdate, loQuantity, loDiscount, > loRevenue, loCommitdate > > *Date carbon data table -* > case class Date ( dDatekey: Int, dDate: String, dDayofweek: > String, dMonth: String, dYear: Int, dYearmonthnum: Int, dYearmonth: > String, dDaynuminweek: Int, dDaynuminmonth: Int, dDaynuminyear: Int, > dMonthnuminyear: Int, dWeeknuminyear: Int, dSellingseason: String, > dLastdayinweekfl: Int, dLastdayinmonthfl: Int, dHolidayfl: Int, > dWeekdayfl: Int, dummy: String) > *Options:* TABLE_BLOCKSIZE=1024, DICTIONARY_INCLUDE = dDatekey, dYear, > dYearmonthnum, dDaynuminweek, dDaynuminmonth, dDaynuminyear, > dMonthnuminyear, dWeeknuminyear, dLastdayinweekfl, dLastdayinmonthfl, > dHolidayfl, dWeekdayfl > > *Spark runtime configurations (Same for both parquet and carbon data) - * > executor-memory = 15g, executor-cores = 6, dynamic allocation (I tried > different configurations as well) > > Parquet runtime = ~17 seconds > Carbon runtime = ~45 seconds. > I tried changing TABLE_BLOCKSIZE to 256MB, 512MB but performance is > still >40 seconds. Both columns (loDiscount, loQuanity) both are > dimensions. I didn't know before that sort_columns property. Don't know if > including filter columns in "dictionary_include" is counter productive. > > Please suggest me any other configurations or options to improve > performance of above query. Help is very much appreciated.. > > Thanks > Swapnil > -- Thanks & Regards, Ravi |
Free forum by Nabble | Edit this page |