niuge01 commented on a change in pull request #3720: URL: https://github.com/apache/carbondata/pull/3720#discussion_r412743071 ########## File path: docs/mv-guide.md ########## @@ -0,0 +1,342 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# CarbonData Materialized View + +* [Quick Example](#quick-example) +* [Introduction](#introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Compaction](#compacting) +* [Data Management](#data-management) +* [Time Series Support](#time-series-support) +* [Time Series RollUp Support](#time-series-rollup-support) + +## Quick example + + Start spark-sql in terminal and run the following queries, + + ``` + CREATE TABLE maintable(a int, b string, c int) stored as carbondata; + INSERT INTO maintable SELECT 1, 'ab', 2; + CREATE MATERIALIZED VIEW view1 AS SELECT a, sum(b) FROM maintable GROUP BY a; + SELECT a, sum(b) FROM maintable GROUP BY a; + // NOTE: run explain query and check if query hits the Index table from the plan + EXPLAIN SELECT a, sum(b) FROM maintable GROUP BY a; + ``` + +## Introduction + + Materialized views are created as sub-queries. User can create limitless materialized view to + improve query performance provided the storage requirements and loading time is acceptable. + + Materialized view can be refreshed on commit or on manual. Once materialized views are created, + CarbonData's MVRewriteRule helps to select the most efficient materialized view based on + the user query and rewrite the SQL to select the data from materialized view instead of + fact tables. Since the data size of materialized view is smaller and data is pre-processed, + user queries are much faster. + + For instance, fact table called **sales** which is defined as. + + ``` + CREATE TABLE sales ( + order_time timestamp, + user_id string, + sex string, + country string, + quantity int, + price bigint) + STORED AS carbondata + ``` + + User can create materialized view using the CREATE MATERIALIZED VIEW statement. + + ``` + CREATE MATERIALIZED VIEW agg_sales + PROPERTIES('TABLE_BLOCKSIZE'='256 MB','LOCAL_DICTIONARY_ENABLE'='false') + AS + SELECT country, sex, sum(quantity), avg(price) + FROM sales + GROUP BY country, sex + ``` + + **NOTE**: + * Group by and Order by columns has to be provided in projection list while creating materialized view. + * If only single fact table is involved in materialized view creation, then TableProperties of + fact table (if not present in a aggregate function like sum(col)) listed below will be + inherited to materialized view. + 1. SORT_COLUMNS + 2. SORT_SCOPE + 3. TABLE_BLOCKSIZE + 4. FLAT_FOLDER + 5. LONG_STRING_COLUMNS + 6. LOCAL_DICTIONARY_ENABLE + 7. LOCAL_DICTIONARY_THRESHOLD + 8. LOCAL_DICTIONARY_EXCLUDE + 9. INVERTED_INDEX + 10. NO_INVERTED_INDEX + 11. COLUMN_COMPRESSOR + * Creating materialized view with select query containing only project of all columns of fact + table is unsupported. + **Example:** + If table 'x' contains columns 'a,b,c', then creating MV Index with below queries is not supported. + 1. ```SELECT a,b,c FROM x``` + 2. ```SELECT * FROM x``` + * TableProperties can be provided in Properties excluding LOCAL_DICTIONARY_INCLUDE, + LOCAL_DICTIONARY_EXCLUDE, INVERTED_INDEX, NO_INVERTED_INDEX, SORT_COLUMNS, LONG_STRING_COLUMNS, + RANGE_COLUMN & COLUMN_META_CACHE. + * TableProperty given in Properties will be considered for materialized view creation, even though + if same property is inherited from fact table, which allows user to provide different table + properties for materialized view. + * Materialized view creation with limit or union all CTAS queries is unsupported. + * Materialized view does not support streaming. + +#### How materialized views are selected + + When a user query is submitted, during query planning phase, CarbonData will collect modular plan + candidates and process the the ModularPlan based on registered summary data sets. Then, + materialized view for this query will be selected among the candidates. + + For the fact table **sales** and materialized view **agg_sales** created above, following queries + ``` + SELECT country, sex, sum(quantity), avg(price) FROM sales GROUP BY country, sex + SELECT sex, sum(quantity) FROM sales GROUP BY sex + SELECT avg(price), country FROM sales GROUP BY country + ``` + + will be transformed by CarbonData's query planner to query against materialized view **agg_sales** + instead of the fact table **sales**. + + However, for following queries + + ``` + SELECT user_id, country, sex, sum(quantity), avg(price) FROM sales GROUP BY user_id, country, sex + SELECT sex, avg(quantity) FROM sales GROUP BY sex + SELECT country, max(price) FROM sales GROUP BY country + ``` + + will query against fact table **sales** only, because it does not satisfy materialized view + selection logic. + +## Loading data + +### Loading data on commit + + In case of WITHOUT DEFERRED REFRESH, for existing table with loaded data, data load to materialized + view will be triggered by the CREATE MATERIALIZED VIEW statement when user creates the materialized + view. + + For incremental loads to fact table, data to materialized view will be loaded once the + corresponding fact table load is completed. + +### Loading data on manual + + In case of WITH DEFERRED REFRESH, data load to materialized view will be triggered by the refresh + command. Materialized view will be in DISABLED state in below scenarios. + + * when materialized view is created. + * when data of fact table and materialized view are not in sync. + + User should fire REFRESH MATERIALIZED VIEW command to sync all segments of fact table with + materialized view and which ENABLES the materialized view for query. + + Command example: + ``` + REFRESH MATERIALIZED VIEW agg_sales + ``` + +### Loading data to multiple materialized views + + During load to fact table, if anyone of the load to materialized view fails, then that + corresponding materialized view will be DISABLED and load to other materialized views mapped + to fact table will continue. + + User can fire REFRESH MATERIALIZED VIEW command to sync or else the subsequent table load + will load the old failed loads along with current load and enable the disabled materialized view. + + **NOTE**: + * In case of InsertOverwrite/Update operation on fact table, all segments of materialized view + will be MARKED_FOR_DELETE and reload to Index table will happen by REFRESH MATERIALIZED VIEW, + in case of materialized view which refresh on manual and once the InsertOverwrite/Update + operation on fact table is finished, in case of materialized view which refresh on commit. + * In case of full scan query, Data Size and Index Size of fact table and materialized view + will not the same, as fact table and materialized view has different column names. + +## Querying data + + Queries are to be made on fact table. While doing query planning, internally CarbonData will check + materialized views which associated the fact table, and do query plan transformation accordingly. Review comment: OK ########## File path: docs/mv-guide.md ########## @@ -0,0 +1,342 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# CarbonData Materialized View + +* [Quick Example](#quick-example) +* [Introduction](#introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Compaction](#compacting) +* [Data Management](#data-management) +* [Time Series Support](#time-series-support) +* [Time Series RollUp Support](#time-series-rollup-support) + +## Quick example + + Start spark-sql in terminal and run the following queries, + + ``` + CREATE TABLE maintable(a int, b string, c int) stored as carbondata; + INSERT INTO maintable SELECT 1, 'ab', 2; + CREATE MATERIALIZED VIEW view1 AS SELECT a, sum(b) FROM maintable GROUP BY a; + SELECT a, sum(b) FROM maintable GROUP BY a; + // NOTE: run explain query and check if query hits the Index table from the plan + EXPLAIN SELECT a, sum(b) FROM maintable GROUP BY a; + ``` + +## Introduction + + Materialized views are created as sub-queries. User can create limitless materialized view to + improve query performance provided the storage requirements and loading time is acceptable. + + Materialized view can be refreshed on commit or on manual. Once materialized views are created, + CarbonData's MVRewriteRule helps to select the most efficient materialized view based on + the user query and rewrite the SQL to select the data from materialized view instead of + fact tables. Since the data size of materialized view is smaller and data is pre-processed, + user queries are much faster. + + For instance, fact table called **sales** which is defined as. + + ``` + CREATE TABLE sales ( + order_time timestamp, + user_id string, + sex string, + country string, + quantity int, + price bigint) + STORED AS carbondata + ``` + + User can create materialized view using the CREATE MATERIALIZED VIEW statement. + + ``` + CREATE MATERIALIZED VIEW agg_sales + PROPERTIES('TABLE_BLOCKSIZE'='256 MB','LOCAL_DICTIONARY_ENABLE'='false') + AS + SELECT country, sex, sum(quantity), avg(price) + FROM sales + GROUP BY country, sex + ``` + + **NOTE**: + * Group by and Order by columns has to be provided in projection list while creating materialized view. + * If only single fact table is involved in materialized view creation, then TableProperties of + fact table (if not present in a aggregate function like sum(col)) listed below will be + inherited to materialized view. + 1. SORT_COLUMNS + 2. SORT_SCOPE + 3. TABLE_BLOCKSIZE + 4. FLAT_FOLDER + 5. LONG_STRING_COLUMNS + 6. LOCAL_DICTIONARY_ENABLE + 7. LOCAL_DICTIONARY_THRESHOLD + 8. LOCAL_DICTIONARY_EXCLUDE + 9. INVERTED_INDEX + 10. NO_INVERTED_INDEX + 11. COLUMN_COMPRESSOR + * Creating materialized view with select query containing only project of all columns of fact + table is unsupported. + **Example:** + If table 'x' contains columns 'a,b,c', then creating MV Index with below queries is not supported. + 1. ```SELECT a,b,c FROM x``` + 2. ```SELECT * FROM x``` + * TableProperties can be provided in Properties excluding LOCAL_DICTIONARY_INCLUDE, + LOCAL_DICTIONARY_EXCLUDE, INVERTED_INDEX, NO_INVERTED_INDEX, SORT_COLUMNS, LONG_STRING_COLUMNS, + RANGE_COLUMN & COLUMN_META_CACHE. + * TableProperty given in Properties will be considered for materialized view creation, even though + if same property is inherited from fact table, which allows user to provide different table + properties for materialized view. + * Materialized view creation with limit or union all CTAS queries is unsupported. + * Materialized view does not support streaming. + +#### How materialized views are selected + + When a user query is submitted, during query planning phase, CarbonData will collect modular plan + candidates and process the the ModularPlan based on registered summary data sets. Then, + materialized view for this query will be selected among the candidates. + + For the fact table **sales** and materialized view **agg_sales** created above, following queries + ``` + SELECT country, sex, sum(quantity), avg(price) FROM sales GROUP BY country, sex + SELECT sex, sum(quantity) FROM sales GROUP BY sex + SELECT avg(price), country FROM sales GROUP BY country + ``` + + will be transformed by CarbonData's query planner to query against materialized view **agg_sales** + instead of the fact table **sales**. + + However, for following queries + + ``` + SELECT user_id, country, sex, sum(quantity), avg(price) FROM sales GROUP BY user_id, country, sex + SELECT sex, avg(quantity) FROM sales GROUP BY sex + SELECT country, max(price) FROM sales GROUP BY country + ``` + + will query against fact table **sales** only, because it does not satisfy materialized view + selection logic. + +## Loading data + +### Loading data on commit + + In case of WITHOUT DEFERRED REFRESH, for existing table with loaded data, data load to materialized + view will be triggered by the CREATE MATERIALIZED VIEW statement when user creates the materialized + view. + + For incremental loads to fact table, data to materialized view will be loaded once the + corresponding fact table load is completed. + +### Loading data on manual + + In case of WITH DEFERRED REFRESH, data load to materialized view will be triggered by the refresh + command. Materialized view will be in DISABLED state in below scenarios. + + * when materialized view is created. + * when data of fact table and materialized view are not in sync. + + User should fire REFRESH MATERIALIZED VIEW command to sync all segments of fact table with + materialized view and which ENABLES the materialized view for query. Review comment: OK ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
Free forum by Nabble | Edit this page |