This post was updated on .
Hi Community,
I found that the Bloomfilter datamap with pre agg datamap will break normal group by query, even make some other queries broken on thrift server side. But if I drop the bloom filter datamap, then query works. ************************************************************************************* Demo SQL: CREATE TABLE IF NOT EXISTS store( market_code STRING, device_code STRING, country_code STRING, category_id INTEGER, product_id LONG, date TIMESTAMP, est_free_app_download LONG, est_paid_app_download LONG, est_revenue LONG ) STORED BY 'carbondata' TBLPROPERTIES( 'SORT_COLUMNS'='market_code, device_code, country_code, category_id, date, product_id', 'NO_INVERTED_INDEX'='est_free_app_download, est_paid_app_download, est_revenue', 'DICTIONARY_INCLUDE' = 'market_code, device_code, country_code, category_id, product_id', 'SORT_SCOPE'='GLOBAL_SORT', 'CACHE_LEVEL'='BLOCKLET', 'TABLE_BLOCKSIZE'='256', 'GLOBAL_SORT_PARTITIONS'='2' ) CREATE DATAMAP IF NOT EXISTS agg_by_day ON TABLE store USING 'timeSeries' DMPROPERTIES ( 'EVENT_TIME'='date', 'DAY_GRANULARITY'='1') AS SELECT date, market_code, device_code, country_code, category_id, COUNT(date), COUNT(est_free_app_download), COUNT(est_free_app_download), COUNT(est_revenue), SUM(est_free_app_download), MIN(est_free_app_download), MAX(est_free_app_download), SUM(est_paid_app_download), MIN(est_paid_app_download), MAX(est_paid_app_download), SUM(est_revenue), MIN(est_revenue), MAX(est_revenue) FROM store GROUP BY date, market_code, device_code, country_code, category_id CREATE DATAMAP IF NOT EXISTS bloomfilter_all_dimensions ON TABLE store USING 'bloomfilter' DMPROPERTIES ( 'INDEX_COLUMNS'='market_code, device_code, country_code, category_id, date, product_id', 'BLOOM_SIZE'='640000', 'BLOOM_FPP'='0.000001', 'BLOOM_COMPRESS'='true' ) ************************************************************************************* This is the stack trace, carbon.time(carbon.sql( | s""" | |SELECT date, market_code, device_code, country_code, category_id, sum(est_free_app_download) | |FROM store | |WHERE date BETWEEN '2016-09-01' AND '2016-09-03' AND device_code='ios-phone' AND country_code='EE' AND category_id=100021 AND product_id IN (590416158, 590437560) | |GROUP BY date, market_code, device_code, country_code, category_id""" | .stripMargin).show(truncate=false) | ) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange hashpartitioning(date#21, market_code#16, device_code#17, country_code#18, category_id#19, 2) +- *(1) HashAggregate(keys=[date#21, market_code#16, device_code#17, country_code#18, category_id#19], functions=[partial_sum(est_free_app_download#22L)], output=[date#21, market_code#16, device_code#17, country_code#18, category_id#19, sum#74L]) +- *(1) CarbonDictionaryDecoder [default_store], IncludeProfile(ArrayBuffer(category_id#19)), CarbonAliasDecoderRelation(), org.apache.spark.sql.CarbonSession@213d5189 +- *(1) Project [market_code#16, device_code#17, country_code#18, category_id#19, date#21, est_free_app_download#22L] +- *(1) FileScan carbondata default.store[category_id#19,market_code#16,country_code#18,device_code#17,est_free_app_download#22L,date#21] PushedFilters: [IsNotNull(date), IsNotNull(device_code), IsNotNull(country_code), IsNotNull(category_id), Greate... at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150) at org.apache.spark.sql.CarbonDictionaryDecoder.inputRDDs(CarbonDictionaryDecoder.scala:244) at org.apache.spark.sql.execution.BaseLimitExec$class.inputRDDs(limit.scala:62) at org.apache.spark.sql.execution.LocalLimitExec.inputRDDs(limit.scala:97) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:337) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) at org.apache.spark.sql.Dataset.showString(Dataset.scala:254) at org.apache.spark.sql.Dataset.show(Dataset.scala:725) at org.apache.spark.sql.Dataset.show(Dataset.scala:702) at $anonfun$1.apply$mcV$sp(<console>:39) at $anonfun$1.apply(<console>:39) at $anonfun$1.apply(<console>:39) at org.apache.spark.sql.SparkSession.time(SparkSession.scala:676) ... 57 elided Caused by: java.lang.NullPointerException at org.apache.carbondata.datamap.bloom.BloomCoarseGrainDataMap.createQueryModel(BloomCoarseGrainDataMap.java:269) at org.apache.carbondata.datamap.bloom.BloomCoarseGrainDataMap.createQueryModel(BloomCoarseGrainDataMap.java:270) at org.apache.carbondata.datamap.bloom.BloomCoarseGrainDataMap.createQueryModel(BloomCoarseGrainDataMap.java:270) at org.apache.carbondata.datamap.bloom.BloomCoarseGrainDataMap.createQueryModel(BloomCoarseGrainDataMap.java:270) at org.apache.carbondata.datamap.bloom.BloomCoarseGrainDataMap.createQueryModel(BloomCoarseGrainDataMap.java:270) at org.apache.carbondata.datamap.bloom.BloomCoarseGrainDataMap.prune(BloomCoarseGrainDataMap.java:181) at org.apache.carbondata.core.datamap.TableDataMap.prune(TableDataMap.java:136) at org.apache.carbondata.core.datamap.dev.expr.DataMapExprWrapperImpl.prune(DataMapExprWrapperImpl.java:53) at org.apache.carbondata.core.datamap.dev.expr.AndDataMapExprWrapper.prune(AndDataMapExprWrapper.java:51) at org.apache.carbondata.core.datamap.dev.expr.AndDataMapExprWrapper.prune(AndDataMapExprWrapper.java:51) at org.apache.carbondata.hadoop.api.CarbonInputFormat.getPrunedBlocklets(CarbonInputFormat.java:515) at org.apache.carbondata.hadoop.api.CarbonInputFormat.getDataBlocksOfSegment(CarbonInputFormat.java:412) at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:528) at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:219) at org.apache.carbondata.spark.rdd.CarbonScanRDD.internalGetPartitions(CarbonScanRDD.scala:124) at org.apache.carbondata.spark.rdd.CarbonRDD.getPartitions(CarbonRDD.scala:61) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:318) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:91) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 94 more scala> carbon.sql("drop datamap bloomfilter_all_dimensions on table store") 18/09/24 05:06:17 AUDIT CarbonDropDataMapCommand: [ec2-dca-aa-p-sdn-16.appannie.org][hadoop][Thread-1]Deleting datamap [bloomfilter_all_dimensions] under table [store] res1: org.apache.spark.sql.DataFrame = [] scala> carbon.time(carbon.sql( | s""" | |SELECT product_id, sum(est_free_app_download) | |FROM store | |WHERE date BETWEEN '2016-09-01' AND '2016-09-03' AND device_code='ios-phone' AND country_code='EE' AND category_id=100021 AND product_id IN (590416158, 590437560) | |GROUP BY product_id""" | .stripMargin).show(truncate=false) | ) +----------+--------------------------+ |product_id|sum(est_free_app_download)| +----------+--------------------------+ |590416158 |2 | |590437560 |null | +----------+--------------------------+ -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
hi, aaron, thanks for your feedback.
Which version of carbondata are you using? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
I use 1.5.0-SNAPSHOT, but I'm not sure about 1.4.1 (I forget that I have test
it or not) -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by aaron
Yeah, I am able to reproduce this problem using current master code. I'll
look into it. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by aaron
Hi, arron.
Actually your query will not use the time series datamap since the filter use filed 'product_id' which is not contained in your preagg datamap. Even I remove the preagg datamap, the query with bloomfilter datamap still failed with the same error logs as that in your post. Then I add some logs in `BloomCoarseGrainDataMap.createQueryModel` and print the input parameter 'expression', and then I found the root cause. Carbondata parsed the query and in DataMapChooser, it combine the filters in a tree, which contains an expression 'TRUE'. This expression is a 'TrueExpression', it has two children, the left is NULL and the right is a LiteralExpression. And in BloomFilterDataMap, it tries to dissolve the expression and applies `createQueryModel` for each child expression in a recursive way. So at last, it will encounter NPE while applying the function for NULL. I'm not sure about the reason of TrueExpression, but I'm sure it is the DataMapChooser that cause this problem. Actually about one month ago, we want to merge another optimization for datamap pruning. Current DataMapChooser forward too many expressions to datamap, even if they are not supported by the datamap. We will optimize this by only forwarding supported expression to the datamap. You can apply the PR#2665 and test it again. I've verified this and it is OK now. Please give your feedback once you have a result. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
More details about this issue. I've add some logs in
`BloomCoarseGrainDataMap.createQueryModel` to print the input parameter 'expression'. # Before applying PR2665 ``` XU expression: org.apache.carbondata.core.scan.expression.logical.AndExpression@3b035d0c XU expression statement:(((((((category_id <> null and device_code <> null) and date <> null) and country_code <> null) and date >= 1472688000000000 between date <= 1472860800000000) and true) and device_code = ios-phone) and country_code = EE) XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.logical.AndExpression@304f4888 XU expression statement:((((((category_id <> null and device_code <> null) and date <> null) and country_code <> null) and date >= 1472688000000000 between date <= 1472860800000000) and true) and device_code = ios-phone) XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.logical.AndExpression@35b97c69 XU expression statement:(((((category_id <> null and device_code <> null) and date <> null) and country_code <> null) and date >= 1472688000000000 between date <= 1472860800000000) and true) XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.logical.AndExpression@2c07277f XU expression statement:((((category_id <> null and device_code <> null) and date <> null) and country_code <> null) and date >= 1472688000000000 between date <= 1472860800000000) XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.logical.AndExpression@d4df4ce XU expression statement:(((category_id <> null and device_code <> null) and date <> null) and country_code <> null) XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.logical.AndExpression@470ce6e7 XU expression statement:((category_id <> null and device_code <> null) and date <> null) XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.logical.AndExpression@39a8905b XU expression statement:(category_id <> null and device_code <> null) XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.conditional.NotEqualsExpression@2c8174ce XU expression statement:category_id <> null XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.ColumnExpression@4e881e14 XU expression statement:category_id XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.LiteralExpression@6e13e2fc XU expression statement:null XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.conditional.NotEqualsExpression@13448d2d XU expression statement:device_code <> null XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.ColumnExpression@3444ac8f XU expression statement:device_code XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.LiteralExpression@3ab26cad XU expression statement:null XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.conditional.NotEqualsExpression@4b477d05 XU expression statement:date <> null XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.ColumnExpression@7c5dbca5 XU expression statement:date XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.LiteralExpression@6d6c4775 XU expression statement:null XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.conditional.NotEqualsExpression@44929971 XU expression statement:country_code <> null XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.ColumnExpression@344d6bb3 XU expression statement:country_code XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.LiteralExpression@2564410b XU expression statement:null XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.logical.RangeExpression@2a3ced3d XU expression statement:date >= 1472688000000000 between date <= 1472860800000000 XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.conditional.GreaterThanEqualToExpression@7ab5b01a XU expression statement:date >= 1472688000000000 XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.ColumnExpression@25fa5c0c XU expression statement:date XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.LiteralExpression@22112da1 XU expression statement:1472688000000000 XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.conditional.LessThanEqualToExpression@6f0969db XU expression statement:date <= 1472860800000000 XU expression children size:2 XU expression: org.apache.carbondata.core.scan.expression.ColumnExpression@38eb2140 XU expression statement:date XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.LiteralExpression@4f06006d XU expression statement:1472860800000000 XU expression children size:0 XU expression: org.apache.carbondata.core.scan.expression.logical.TrueExpression@64514009 XU expression statement:true XU expression children size:2 XU expression: null ------ **which causes the problem** ``` # After applying PR2665 ``` XU expression: org.apache.carbondata.core.scan.expression.conditional.EqualToExpression@4cb42a5a XU expression statement:device_code = ios-phone XU expression children size:2 Read 2 bloom indices from D:/01_workspace/carbondata2/integration/spark-common/target/warehouse/carbon_bloom/bloom_dm/0/mergeShard\device_code.bloomindexmerge XU expression: org.apache.carbondata.core.scan.expression.conditional.EqualToExpression@6dfe4788 XU expression statement:country_code = EE XU expression children size:2 Read 2 bloom indices from D:/01_workspace/carbondata2/integration/spark-common/target/warehouse/carbon_bloom/bloom_dm/0/mergeShard\country_code.bloomindexmerge XU expression: org.apache.carbondata.core.scan.expression.conditional.EqualToExpression@4095ebff XU expression statement:category_id = 100021 XU expression children size:2 Read 2 bloom indices from D:/01_workspace/carbondata2/integration/spark-common/target/warehouse/carbon_bloom/bloom_dm/0/mergeShard\category_id.bloomindexmerge XU expression: org.apache.carbondata.core.scan.expression.conditional.InExpression@1def2d16 XU expression statement:product_id in (LiteralExpression(590416158);LiteralExpression(590437560);) XU expression children size: ``` We can see only bloomfilter supported expression has been forwarded. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xuchuanyin
Great! thanks for your so quick response! I will have a try. Do you mean
that I merge https://github.com/apache/carbondata/pull/2665? Thanks aaron -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
You can download the patch and apply it to master, then you can rebuild the jar and perform testing.
On Tue, Sep 25, 2018 at 5:02 PM +0800, "aaron" <[hidden email]> wrote: Great! thanks for your so quick response! I will have a try. Do you mean that I merge https://github.com/apache/carbondata/pull/2665? Thanks aaron -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Yes, you're right. The fix make master work now.
-- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xuchuanyin
But one more comment, it seems that bloomfilter datamap disappears from the
query plan in detailed query? so what's the case which is for the bloomfilter? -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Did you use the query in the first post? I tested it and it's OK, we can see
the bloomfilter while explaining. 1. If bloomfilter is not there, the reason may be that the main datamap has already pruned all the blocklets. In this case, the following index datamap will be skipped for shortcut. Often, the query will be empty. 2. Or your query hit the preagg datamap and it will not use index datamap for that query, because the query plan has already been rewritten for querying preagg. Can you post the explaining result if you still have problems. Ps: the data I used for testing: ``` INSERT INTO store VALUES ("market_code_str", "device_code_str", "country_code_str", 100000, 590416100, "2016-08-01 12:00:00", 101, 1001, 10001), ("market_code_str", "ios-phone", "CC", 100000, 590416100, "2016-09-02 12:00:00", 101, 1001, 10001), ("market_code_str", "ios-phone", "EE", 100001, 590416158, "2016-09-02 12:00:00", 101, 1001, 10001), ("market_code_str", "ios-phone", "EE", 100021, 590416158, "2016-09-02 12:00:00", 101, 1001, 10001), ("market_code_str", "ios-phone", "FF", 100031, 590437560, "2016-09-02 12:00:00", 101, 1001, 10001), ("market_code_str", "ios-phone", "EE", 100021, 590437560, "2016-09-02 12:00:00", 101, 1001, 10001); ``` This data will always make the mentioned query hit the blocklet in default blocklet datamap, so it will comes to bloom datamap. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Based on that fix, drop existed table and data, re-creating the table and
datamap is exactly as you said, no problem. But I did not delete the data and table yesterday, just create a new datamap, there will be some problems. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |