Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] niuge01 commented on a change in pull request #3720: [CARBONDATA-3775] Update materialized view document

Classic

List

Threaded

1 message

GitBox

[GitHub] [carbondata] niuge01 commented on a change in pull request #3720: [CARBONDATA-3775] Update materialized view document

niuge01 commented on a change in pull request #3720:
URL: https://github.com/apache/carbondata/pull/3720#discussion_r411841503

##########
File path: docs/mv-guide.md
##########
@@ -0,0 +1,342 @@
+
+
+# CarbonData Materialized View
+
+* [Quick Example](#quick-example)
+* [Introduction](#introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Compaction](#compacting)
+* [Data Management](#data-management)
+* [Time Series Support](#time-series-support)
+* [Time Series RollUp Support](#time-series-rollup-support)
+
+## Quick example
+
+ Start spark-sql in terminal and run the following queries,
+
+ ```
+ CREATE TABLE maintable(a int, b string, c int) stored as carbondata;
+ INSERT INTO maintable SELECT 1, 'ab', 2;
+ CREATE MATERIALIZED VIEW view1 AS SELECT a, sum(b) FROM maintable GROUP BY a;
+ SELECT a, sum(b) FROM maintable GROUP BY a;
+ // NOTE: run explain query and check if query hits the Index table from the plan
+ EXPLAIN SELECT a, sum(b) FROM maintable GROUP BY a;
+ ```
+
+## Introduction
+
+ Materialized views are created as sub-queries. User can create limitless materialized view to
+ improve query performance provided the storage requirements and loading time is acceptable.
+
+ Materialized view can be refreshed on commit or on manual. Once materialized views are created,
+ CarbonData's MVRewriteRule helps to select the most efficient materialized view based on
+ the user query and rewrite the SQL to select the data from materialized view instead of
+ related tables. Since the data size of materialized view is smaller and data is pre-processed,
+ user queries are much faster.
+
+ For instance, related table called **sales** which is defined as.
+
+ ```
+ CREATE TABLE sales (
+ order_time timestamp,
+ user_id string,
+ sex string,
+ country string,
+ quantity int,
+ price bigint)
+ STORED AS carbondata
+ ```
+
+ User can create materialized view using the CREATE MATERIALIZED VIEW statement.
+
+ ```
+ CREATE MATERIALIZED VIEW agg_sales
+ PROPERTIES('TABLE_BLOCKSIZE'='256 MB','LOCAL_DICTIONARY_ENABLE'='false')
+ AS
+ SELECT country, sex, sum(quantity), avg(price)
+ FROM sales
+ GROUP BY country, sex
+ ```
+
+ **NOTE**:
+ * Group by and Order by columns has to be provided in projection list while creating materialized view.
+ * If only single related table is involved in materialized view creation, then TableProperties of
+ related table (if not present in a aggregate function like sum(col)) listed below will be
+ inherited to materialized view.
+ 1. SORT_COLUMNS
+ 2. SORT_SCOPE
+ 3. TABLE_BLOCKSIZE
+ 4. FLAT_FOLDER
+ 5. LONG_STRING_COLUMNS
+ 6. LOCAL_DICTIONARY_ENABLE
+ 7. LOCAL_DICTIONARY_THRESHOLD
+ 8. LOCAL_DICTIONARY_EXCLUDE
+ 9. INVERTED_INDEX
+ 10. NO_INVERTED_INDEX
+ 11. COLUMN_COMPRESSOR
+ * Creating materialized view with select query containing only project of all columns of related
+ table is unsupported.
+ **Example:**
+ If table 'x' contains columns 'a,b,c', then creating MV Index with below queries is not supported.
+ 1. ```SELECT a,b,c FROM x```
+ 2. ```SELECT * FROM x```
+ * TableProperties can be provided in Properties excluding LOCAL_DICTIONARY_INCLUDE,
+ LOCAL_DICTIONARY_EXCLUDE, INVERTED_INDEX, NO_INVERTED_INDEX, SORT_COLUMNS, LONG_STRING_COLUMNS,
+ RANGE_COLUMN & COLUMN_META_CACHE.
+ * TableProperty given in Properties will be considered for materialized view creation, even though
+ if same property is inherited from related table, which allows user to provide different table
+ properties for materialized view.
+ * Materialized view creation with limit or union all CTAS queries is unsupported.
+ * Materialized view does not support streaming.
+
+#### How materialized views are selected
+
+ When a user query is submitted, during query planning phase, CarbonData will collect modular plan
+ candidates and process the the ModularPlan based on registered summary data sets. Then,
+ materialized view for this query will be selected among the candidates.
+
+ For the related table **sales** and materialized view **agg_sales** created above, following queries
+ ```
+ SELECT country, sex, sum(quantity), avg(price) FROM sales GROUP BY country, sex
+ SELECT sex, sum(quantity) FROM sales GROUP BY sex
+ SELECT avg(price), country FROM sales GROUP BY country
+ ```
+
+ will be transformed by CarbonData's query planner to query against materialized view **agg_sales**
+ instead of the related table **sales**.
+
+ However, for following queries
+
+ ```
+ SELECT user_id, country, sex, sum(quantity), avg(price) FROM sales GROUP BY user_id, country, sex
+ SELECT sex, avg(quantity) FROM sales GROUP BY sex
+ SELECT country, max(price) FROM sales GROUP BY country
+ ```
+
+ will query against related table **sales** only, because it does not satisfy materialized view
+ selection logic.
+
+## Loading data
+
+### Loading data on commit
+
+ In case of WITHOUT DEFERRED REFRESH, for existing table with loaded data, data load to materialized
+ view will be triggered by the CREATE MATERIALIZED VIEW statement when user creates the materialized
+ view.
+
+ For incremental loads to related table, data to materialized view will be loaded once the
+ corresponding related table load is completed.
+
+### Loading data on manual
+
+ In case of WITH DEFERRED REFRESH, data load to materialized view will be triggered by the refresh
+ command. Materialized view will be in DISABLED state in below scenarios.
+
+ * when materialized view is created.
+ * when data of related table and materialized view are not in sync.
+
+ User should fire REFRESH MATERIALIZED VIEW command to sync all segments of related table with
+ materialized view and which ENABLES the materialized view for query.
+
+ Command example:
+ ```
+ REFRESH MATERIALIZED VIEW agg_sales
+ ```
+
+### Loading data to multiple materialized views
+
+ During load to related table, if anyone of the load to materialized view fails, then that
+ corresponding materialized view will be DISABLED and load to other materialized views mapped
+ to related table will continue.
+
+ User can fire REFRESH MATERIALIZED VIEW command to sync or else the subsequent table load
+ will load the old failed loads along with current load and enable the disabled materialized view.
+
+ **NOTE**:
+ * In case of InsertOverwrite/Update operation on related table, all segments of materialized view
+ will be MARKED_FOR_DELETE and reload to Index table will happen by REFRESH MATERIALIZED VIEW,
+ in case of materialized view which refresh on manual and once the InsertOverwrite/Update
+ operation on related table is finished, in case of materialized view which refresh on commit.
+ * In case of full scan query, Data Size and Index Size of related table and materialized view
+ will not the same, as related table and materialized view has different column names.
+
+## Querying data
+
+ Queries are to be made on related table. While doing query planning, internally CarbonData will check
+ materialized views which associated the related table, and do query plan transformation accordingly.
+
+ User can verify whether a query can leverage materialized view or not by executing `EXPLAIN` command,
+ which will show the transformed logical plan, and thus user can check whether materialized view
+ is selected.
+
+## Compacting
+
+ Running Compaction command (`ALTER TABLE COMPACT`)[COMPACTION TYPE-> MINOR/MAJOR] on related table
+ will automatically compact the materialized view created on the related table, once compaction
+ on related table is done.
+
+## Data Management
+
+ In current implementation, data consistency needs to be maintained for both related table and
+ materialized views.
+
+ Once there is materialized view created on the related table, following command on the related
+ table is not supported:
+
+ 1. Data management command: `DELETE SEGMENT`.
+ 2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`,
+ `ALTER TABLE RENAME`, `ALTER COLUMN RENAME`. Note that adding a new column is supported, and for
+ dropping columns and change datatype command, CarbonData will check whether it will impact the
+ materialized view, if not, the operation is allowed, otherwise operation will be rejected by
+ throwing exception.
+ 3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`. Note that dropping a partition
+ will be allowed only if partition is participating in all indexes associated with related table.
+ Drop Partition is not allowed, if any materialized view is associated with more than one
+ related table. Drop Partition directly on materialized view is not allowed.
+ 4. Complex Datatype's for materialized view is not supported.
+
+ However, there is still way to support these operations on related table, in current CarbonData
+ release, user can do as following:
+
+ 1. Remove the materialized by `DROP MATERIALIZED VIEW` command.
+ 2. Carry out the data management operation on related table.
+ 3. Create the materialized view again by `CREATE MATERIALIZED VIEW` command.
+
+ Basically, user can manually trigger the operation by re-building the materialized view.
+
+## Time Series Support
+
+ Time series data are simply measurements or events that are tracked, monitored, down sampled, and
+ aggregated over time. Materialized views with automatic refresh mode supports TimeSeries queries.
+
+ CarbonData provides built-in time-series udf with the below definition.
+
+ ```
+ timeseries(event_time_column, 'granularity')
+ ```
+
+ Event time columns provided in time series udf should be of TimeStamp/Date type.
+
+ Below table describes the time hierarchy and levels that can be provided in a time-series udf,
+ so that it supports automatic roll-up in time dimension for query.
+
+ | Granularity | Description |
+ |----------------|-------------------------------------------------------|
+ | year | Data will be aggregated over year |
+ | month | Data will be aggregated over month |
+ | week | Data will be aggregated over week |
+ | day | Data will be aggregated over day |
+ | hour | Data will be aggregated over hour |
+ | thirty_minute | Data will be aggregated over every thirty minutes |
+ | fifteen_minute | Data will be aggregated over every fifteen minutes |
+ | ten_minute | Data will be aggregated over every ten minutes |
+ | five_minute | Data will be aggregated over every five minutes |
+ | minute | Data will be aggregated over every one minute |
+ | second | Data will be aggregated over every second |
+
+ Time series udf having column as Date type support's only year, month, day and week granularities.
+
+ Below is the sample data loaded to the fact table **sales**.
+
+ ```
+ order_time, user_id, sex, country, quantity, price
+ 2016-02-23 09:01:30, c001, male, xxx, 100, 2
+ 2016-02-23 09:01:50, c002, male, yyy, 200, 5
+ 2016-02-23 09:03:30, c003, female, xxx, 400, 1
+ 2016-02-23 09:03:50, c004, male, yyy, 300, 5
+ 2016-02-23 09:07:50, c005, female, xxx, 500, 5
+ ```
+
+ Users can create materialized views with time series queries like the below example:
+
+ ```
+ CREATE MATERIALIZED VIEW agg_sales
+ SELECT timeseries(order_time, 'minute'),avg(price)
+ FROM sales
+ GROUP BY timeseries(order_time, 'minute')
+ ```
+ And execute the below query to check time series data. In this example, a materialized view of
+ aggregated table on price column will be created, which will be aggregated on every one minute.
+
+ ```
+ SELECT timeseries(order_time,'minute'), avg(price)
+ FROM sales
+ GROUP BY timeseries(order_time,'minute')
+ ```
+ Find below the result of above query aggregated over minute.
+
+ ```
+ +---------------------------------------+----------------+
+ |UDF:timeseries(projectjoindate, minute)|avg(projectcode)|

Review comment:
OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]