GitHub user sraghunandan opened a pull request:
https://github.com/apache/carbondata/pull/2022 [CARBONDATA-2098] Optimize pre-aggregate documentation optimize pre-aggregate documentation move to separate file add more examples Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [x] Any interfaces changed? No - [x] Any backward compatibility impacted? No - [x] Document update required? Updating docs - [x] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. NA - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/sraghunandan/carbondata-1 agg_doc_new_file Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2022.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2022 ---- commit 742359d1640bab97b3c0d40d948b0bedf8fe6a30 Author: sraghunandan <carbondatacontributions@...> Date: 2018-03-02T11:32:39Z optimize pre-aggregate documentation;move to separate file;add more examples ---- --- |
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2022 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2795/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2022 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4041/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2022 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2800/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2022 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4046/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2022 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4050/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2022 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2804/ --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2022#discussion_r172000793 --- Diff: docs/preaggregate-guide.md --- @@ -0,0 +1,313 @@ +# CarbonData Pre-aggregate tables + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS sales") + // Create target carbon table and populate with initial data + spark.sql( + s""" + | CREATE TABLE sales ( + | user_id string, + | country string, + | quantity int, + | price bigint) + | STORED BY 'carbondata'""".stripMargin) + + spark.sql( + s""" + | CREATE DATAMAP agg_sales + | ON TABLE sales + | USING "preaggregate" + | AS + | SELECT country, sum(quantity), avg(price) + | FROM sales + | GROUP BY country""".stripMargin) + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + val r = new Random() + val df = spark.sparkContext.parallelize(1 to 10) + .map(x => ("ID." + r.nextInt(100000), "country" + x % 8, x % 50, x % 60)) + .toDF("user_id", "country", "quantity", "price") + + // Create table with pre-aggregate table + df.write.format("carbondata") + .option("tableName", "sales") + .option("compress", "true") + .mode(SaveMode.Append).save() + + spark.sql( + s""" + |SELECT country, sum(quantity), avg(price) + | from sales GROUP BY country""".stripMargin).show + + spark.stop +``` + +##PRE-AGGREGATE TABLES + Carbondata supports pre aggregating of data so that OLAP kind of queries can fetch data + much faster.Aggregate tables are created as datamaps so that the handling is as efficient as + other indexing support.Users can create as many aggregate tables they require as datamaps to + improve their query performance,provided the storage requirements and loading speeds are + acceptable. + + For main table called **sales** which is defined as + + ``` + CREATE TABLE sales ( + order_time timestamp, + user_id string, + sex string, + country string, + quantity int, + price bigint) + STORED BY 'carbondata' + ``` + + user can create pre-aggregate tables using the DDL + + ``` + CREATE DATAMAP agg_sales + ON TABLE sales + USING "preaggregate" + AS + SELECT country, sex, sum(quantity), avg(price) + FROM sales + GROUP BY country, sex + ``` + + + +<b><p align="left">Functions supported in pre-aggregate tables</p></b> + +| Function | Rollup supported | +|-----------|----------------| +| SUM | Yes | +| AVG | Yes | +| MAX | Yes | +| MIN | Yes | +| COUNT | Yes | + + +##### How pre-aggregate tables are selected +For the main table **sales** and pre-aggregate table **agg_sales** created above, queries of the +kind +``` +SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex + +SELECT sex, sum(quantity) from sales GROUP BY sex + +SELECT sum(price), country from sales GROUP BY country +``` + +will be transformed by Query Planner to fetch data from pre-aggregate table **agg_sales** + +But queries of kind +``` +SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex + +SELECT sex, avg(quantity) from sales GROUP BY sex + +SELECT country, max(price) from sales GROUP BY country +``` + +will fetch the data from the main table **sales** + +##### Loading data to pre-aggregate tables +For existing table with loaded data, data load to pre-aggregate table will be triggered by the +CREATE DATAMAP statement when user creates the pre-aggregate table. +For incremental loads after aggregates tables are created, loading data to main table triggers +the load to pre-aggregate tables once main table loading is complete. These loads are automic +meaning that data on main table and aggregate tables are only visible to the user after all tables +are loaded + +##### Querying data from pre-aggregate tables +Pre-aggregate tables cannot be queries directly. Queries are to be made on main table. Internally +carbondata will check associated pre-aggregate tables with the main table, and if the +pre-aggregate tables satisfy the query condition, the plan is transformed automatically to use +pre-aggregate table to fetch the data. + +##### Compacting pre-aggregate tables +Compaction command (ALTER TABLE COMPACT) need to be run separately on each pre-aggregate table. +Running Compaction command on main table will **not automatically** compact the pre-aggregate +tables.Compaction is an optional operation for pre-aggregate table. If compaction is performed on +main table but not performed on pre-aggregate table, all queries still can benefit from +pre-aggregate tables. To further improve performance on pre-aggregate tables, compaction can be +triggered on pre-aggregate tables directly, it will merge the segments inside pre-aggregate table. + +##### Update/Delete Operations on pre-aggregate tables +This functionality is not supported. + + NOTE (<b>RESTRICTION</b>): + Update/Delete operations are <b>not supported</b> on main table which has pre-aggregate tables + created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete + operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually + after update/delete operations are completed + +##### Delete Segment Operations on pre-aggregate tables +This functionality is not supported. + + NOTE (<b>RESTRICTION</b>): + Delete Segment operations are <b>not supported</b> on main table which has pre-aggregate tables + created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete + operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually + after delete segment operations are completed + +##### Alter Table Operations on pre-aggregate tables +This functionality is not supported. + + NOTE (<b>RESTRICTION</b>): + Adding new column in new table does not have any affect on pre-aggregate tables. However if + dropping or renaming a column has impact in pre-aggregate table, such operations will be + rejected and error will be thrown. All the pre-aggregate tables <b>will have to be dropped</b> + before Alter Operations can be performed on the main table. Pre-aggregate tables can be rebuilt + manually after Alter Table operations are completed + +### Supporting timeseries data (Alpha feature in 1.3.0) --- End diff -- I think it is better we create a datamap folder under doc folder and put preaggregate guide and timeseries guide doc separately in datamap folder. --- |
In reply to this post by qiuchenjian-2
|
In reply to this post by qiuchenjian-2
|
Free forum by Nabble | Edit this page |