jackylk opened a new pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625 ### Why is this PR needed? Materialized View is a feature built on top of Spark, it should support not only carbondata format but also other formats. ### What changes were proposed in this PR? 1. Added an API in CarbonMetaStore to lookup any relation, not just carbon relation 2. When creating MV, use newly added lookup relation to get all parent table relations. Use CatalogTable instead of CarbonTable whenever possible 3. Skip the segment handling when loading MV ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - Yes (WIP, need to add more testcase) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-586884120 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/317/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-586909581 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2019/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-586933814 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/318/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-586955255 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2020/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-587468262 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/339/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-587505158 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2041/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Indhumathi27 commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r382534896 ########## File path: datamap/mv/core/src/test/scala/org/apache/carbondata/mv/rewrite/MVCreateTestCase.scala ########## @@ -104,6 +104,58 @@ class MVCreateTestCase extends QueryTest with BeforeAndAfterAll { sql(s"""LOAD DATA local inpath '$resourcesPath/data_big.csv' INTO TABLE fact_table6 OPTIONS('DELIMITER'= ',', 'QUOTECHAR'= '"')""") } + test("test create mv on parquet spark table") { + sql("drop materialized view if exists mv1") + sql("drop table if exists source") + sql("create table source using parquet as select * from fact_table1") + sql("create materialized view mv1 as select empname, deptname, avg(salary) from source group by empname, deptname") + val df = sql("select empname, avg(salary) from source group by empname") Review comment: Please add a testcase with following steps: 1. create parquet table, create mv, chekc results 2. load to parquet table 3. check results ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r382912976 ########## File path: datamap/mv/core/src/test/scala/org/apache/carbondata/mv/rewrite/MVCreateTestCase.scala ########## @@ -104,6 +104,58 @@ class MVCreateTestCase extends QueryTest with BeforeAndAfterAll { sql(s"""LOAD DATA local inpath '$resourcesPath/data_big.csv' INTO TABLE fact_table6 OPTIONS('DELIMITER'= ',', 'QUOTECHAR'= '"')""") } + test("test create mv on parquet spark table") { + sql("drop materialized view if exists mv1") + sql("drop table if exists source") + sql("create table source using parquet as select * from fact_table1") + sql("create materialized view mv1 as select empname, deptname, avg(salary) from source group by empname, deptname") + val df = sql("select empname, avg(salary) from source group by empname") Review comment: ok, added ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r382912976 ########## File path: datamap/mv/core/src/test/scala/org/apache/carbondata/mv/rewrite/MVCreateTestCase.scala ########## @@ -104,6 +104,58 @@ class MVCreateTestCase extends QueryTest with BeforeAndAfterAll { sql(s"""LOAD DATA local inpath '$resourcesPath/data_big.csv' INTO TABLE fact_table6 OPTIONS('DELIMITER'= ',', 'QUOTECHAR'= '"')""") } + test("test create mv on parquet spark table") { + sql("drop materialized view if exists mv1") + sql("drop table if exists source") + sql("create table source using parquet as select * from fact_table1") + sql("create materialized view mv1 as select empname, deptname, avg(salary) from source group by empname, deptname") + val df = sql("select empname, avg(salary) from source group by empname") Review comment: ok, added. see testcase name "test create mv on parquet spark table" ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-589958619 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/400/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-589965909 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2101/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-590418277 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/440/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#issuecomment-590463974 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2140/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r383437391 ########## File path: mv/core/src/main/scala/org/apache/carbondata/mv/extension/MVHelper.scala ########## @@ -610,13 +616,12 @@ object MVHelper { val relationList = fields._2.columnTableRelationList // check if columns present in datamap are long_string_col in parent table. If they are // long_string_columns in parent, make them long_string_columns in datamap - if (aggFunc.isEmpty && - relationList.size == 1 && - longStringColumnInParents.contains(relationList.head.parentColumnName)) { - varcharDatamapFields += relationList.head.parentColumnName + if (aggFunc.isEmpty && relationList.size == 1 && longStringColumnInParents + .contains(relationList.head.columnName)) { + varcharDatamapFields += relationList.head.columnName } }) - if (varcharDatamapFields.nonEmpty) { + if (!varcharDatamapFields.isEmpty) { Review comment: revert this change, old one was correct ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r383437109 ########## File path: mv/core/src/main/scala/org/apache/carbondata/mv/extension/MVHelper.scala ########## @@ -568,17 +573,18 @@ object MVHelper { generatePartitionerField(allPartitionColumn.toList, Seq.empty) } - private def inheritTablePropertiesFromMainTable(parentTable: CarbonTable, + private def inheritTablePropertiesFromMainTable( + parentTable: CarbonTable, fields: Seq[Field], fieldRelationMap: scala.collection.mutable.LinkedHashMap[Field, MVField], tableProperties: mutable.Map[String, String]): Unit = { var neworder = Seq[String]() - val parentOrder = parentTable.getSortColumns.asScala + val parentOrder = parentTable.getSortColumns().asScala parentOrder.foreach(parentcol => fields.filter(col => fieldRelationMap(col).aggregateFunction.isEmpty && fieldRelationMap(col).columnTableRelationList.size == 1 && parentcol.equalsIgnoreCase( - fieldRelationMap(col).columnTableRelationList(0).parentColumnName)) + fieldRelationMap(col).columnTableRelationList(0).columnName)) Review comment: replace with `.head` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r383443633 ########## File path: mv/core/src/test/scala/org/apache/carbondata/mv/rewrite/MVCreateTestCase.scala ########## @@ -104,6 +104,79 @@ class MVCreateTestCase extends QueryTest with BeforeAndAfterAll { sql(s"""LOAD DATA local inpath '$resourcesPath/data_big.csv' INTO TABLE fact_table6 OPTIONS('DELIMITER'= ',', 'QUOTECHAR'= '"')""") } + test("test create mv on parquet spark table") { + sql("drop materialized view if exists mv1") + sql("drop table if exists source") + sql("create table source using parquet as select * from fact_table1") + sql("create materialized view mv1 as select empname, deptname, avg(salary) from source group by empname, deptname") + var df = sql("select empname, avg(salary) from source group by empname") + assert(TestUtil.verifyMVDataMap(df.queryExecution.optimizedPlan, "mv1")) + checkAnswer(df, sql("select empname, avg(salary) from fact_table2 group by empname")) + + // load to parquet table and check again + sql("insert into source select * from fact_table1") + df = sql("select empname, avg(salary) from source group by empname") + assert(TestUtil.verifyMVDataMap(df.queryExecution.optimizedPlan, "mv1")) + checkAnswer(df, sql("select empname, avg(salary) from fact_table2 group by empname")) + + sql(s"drop materialized view mv1") + sql("drop table source") + } + + test("test create mv on orc spark table") { + sql("drop materialized view if exists mv1") + sql("drop table if exists source") + sql("create table source using orc as select * from fact_table1") + sql("create materialized view mv1 as select empname, deptname, avg(salary) from source group by empname, deptname") + var df = sql("select empname, avg(salary) from source group by empname") + assert(TestUtil.verifyMVDataMap(df.queryExecution.optimizedPlan, "mv1")) + checkAnswer(df, sql("select empname, avg(salary) from fact_table2 group by empname")) + + // load to orc table and check again + sql("insert into source select * from fact_table1") + df = sql("select empname, avg(salary) from source group by empname") + assert(TestUtil.verifyMVDataMap(df.queryExecution.optimizedPlan, "mv1")) + checkAnswer(df, sql("select empname, avg(salary) from fact_table2 group by empname")) + + sql(s"drop materialized view mv1") + sql("drop table source") + } + + test("test create mv on parquet hive table") { + sql("drop materialized view if exists mv1") + sql("drop table if exists source") + sql("create table source stored as parquet as select * from fact_table1") + sql("create materialized view mv1 as select empname, deptname, avg(salary) from source group by empname, deptname") + var df = sql("select empname, avg(salary) from source group by empname") + assert(TestUtil.verifyMVDataMap(df.queryExecution.optimizedPlan, "mv1")) + checkAnswer(df, sql("select empname, avg(salary) from fact_table2 group by empname")) + + // load to parquet table and check again + sql("insert into source select * from fact_table1") + df = sql("select empname, avg(salary) from source group by empname") + assert(TestUtil.verifyMVDataMap(df.queryExecution.optimizedPlan, "mv1")) + checkAnswer(df, sql("select empname, avg(salary) from fact_table2 group by empname")) + + sql(s"drop materialized view mv1") + sql("drop table source") + } + Review comment: can you add a test case for partition also? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r383447624 ########## File path: mv/core/src/main/scala/org/apache/carbondata/mv/extension/MVHelper.scala ########## @@ -72,42 +69,41 @@ object MVHelper { dataMapSchema: DataMapSchema, queryString: String, ifNotExistsSet: Boolean = false): Unit = { - val dmProperties = dataMapSchema.getProperties.asScala - if (dmProperties.contains("streaming") && dmProperties("streaming").equalsIgnoreCase("true")) { + val properties = dataMapSchema.getProperties.asScala + if (properties.contains("streaming") && properties("streaming").equalsIgnoreCase("true")) { throw new MalformedCarbonCommandException( s"Materialized view does not support streaming" ) } val mvUtil = new MVUtil - mvUtil.validateDMProperty(dmProperties) - val logicalPlan = dropDummyFunc( - MVParser.getMVPlan(queryString, sparkSession)) + mvUtil.validateDMProperty(properties) + val queryPlan = dropDummyFunc(MVParser.getMVPlan(queryString, sparkSession)) // if there is limit in MV ctas query string, throw exception, as its not a valid usecase - logicalPlan match { + queryPlan match { case Limit(_, _) => throw new MalformedCarbonCommandException("Materialized view does not support the query " + "with limit") case _ => } - val selectTables = getTables(logicalPlan) - if (selectTables.isEmpty) { + val mainTables = getCatalogTables(queryPlan) + if (mainTables.isEmpty) { throw new MalformedCarbonCommandException( s"Non-Carbon table does not support creating MV datamap") } - val modularPlan = validateMVQuery(sparkSession, logicalPlan) + val modularPlan = validateMVQuery(sparkSession, queryPlan) val updatedQueryWithDb = modularPlan.asCompactSQL - val (timeSeriesColumn, granularity): (String, String) = validateMVTimeSeriesQuery( - logicalPlan, - dataMapSchema) - val fullRebuild = isFullReload(logicalPlan) + val (timeSeriesColumn, granularity) = validateMVTimeSeriesQuery(queryPlan, dataMapSchema) + val isQueryNeedFullRebuild = checkIsQueryNeedFullRebuild(queryPlan) + val isHasNonCarbonProvider = checkIsMainTableHasNonCarbonSource(mainTables) Review comment: just `hasNonCarbonProvider` would be fine i think ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r383454314 ########## File path: mv/core/src/main/scala/org/apache/carbondata/mv/extension/MVHelper.scala ########## @@ -568,17 +573,18 @@ object MVHelper { generatePartitionerField(allPartitionColumn.toList, Seq.empty) } - private def inheritTablePropertiesFromMainTable(parentTable: CarbonTable, + private def inheritTablePropertiesFromMainTable( Review comment: this method we can skip for non carbon tables right? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #3625: [CARBONDATA-3705] Support create and load MV for spark datasource table
URL: https://github.com/apache/carbondata/pull/3625#discussion_r383454639 ########## File path: mv/core/src/main/scala/org/apache/carbondata/mv/extension/MVHelper.scala ########## @@ -568,17 +573,18 @@ object MVHelper { generatePartitionerField(allPartitionColumn.toList, Seq.empty) } - private def inheritTablePropertiesFromMainTable(parentTable: CarbonTable, + private def inheritTablePropertiesFromMainTable( + parentTable: CarbonTable, fields: Seq[Field], fieldRelationMap: scala.collection.mutable.LinkedHashMap[Field, MVField], tableProperties: mutable.Map[String, String]): Unit = { var neworder = Seq[String]() - val parentOrder = parentTable.getSortColumns.asScala + val parentOrder = parentTable.getSortColumns().asScala Review comment: please revert the change as old one is correct ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |