Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2022: [CARBONDATA-2098] Optimize pre-aggregate docu...

Classic

List

10 messages Options

Options

[GitHub] carbondata pull request #2022: [CARBONDATA-2098] Optimize pre-aggregate docu...

GitHub user sraghunandan opened a pull request:

https://github.com/apache/carbondata/pull/2022

[CARBONDATA-2098] Optimize pre-aggregate documentation

optimize pre-aggregate documentation
move to separate file
add more examples

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [x] Any interfaces changed?
No
- [x] Any backward compatibility impacted?
No
- [x] Document update required?
Updating docs
- [x] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.
NA
- [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
NA

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sraghunandan/carbondata-1 agg_doc_new_file

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2022.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2022

----
commit 742359d1640bab97b3c0d40d948b0bedf8fe6a30
Author: sraghunandan <carbondatacontributions@...>
Date: 2018-03-02T11:32:39Z

optimize pre-aggregate documentation;move to separate file;add more examples

----

---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2022

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2795/

---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2022

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4041/

---

[GitHub] carbondata issue #2022: [WIP][CARBONDATA-2098] Optimize pre-aggregate docume...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2022

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2800/

---

[GitHub] carbondata issue #2022: [WIP][CARBONDATA-2098] Optimize pre-aggregate docume...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2022

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4046/

---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2022

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4050/

---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2022

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2804/

---

[GitHub] carbondata pull request #2022: [CARBONDATA-2098] Optimize pre-aggregate docu...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2022#discussion_r172000793

--- Diff: docs/preaggregate-guide.md ---
@@ -0,0 +1,313 @@
+# CarbonData Pre-aggregate tables
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS sales")
+ // Create target carbon table and populate with initial data
+ spark.sql(
+ s"""
+ | CREATE TABLE sales (
+ | user_id string,
+ | country string,
+ | quantity int,
+ | price bigint)
+ | STORED BY 'carbondata'""".stripMargin)
+
+ spark.sql(
+ s"""
+ | CREATE DATAMAP agg_sales
+ | ON TABLE sales
+ | USING "preaggregate"
+ | AS
+ | SELECT country, sum(quantity), avg(price)
+ | FROM sales
+ | GROUP BY country""".stripMargin)
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ val r = new Random()
+ val df = spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("ID." + r.nextInt(100000), "country" + x % 8, x % 50, x % 60))
+ .toDF("user_id", "country", "quantity", "price")
+
+ // Create table with pre-aggregate table
+ df.write.format("carbondata")
+ .option("tableName", "sales")
+ .option("compress", "true")
+ .mode(SaveMode.Append).save()
+
+ spark.sql(
+ s"""
+ |SELECT country, sum(quantity), avg(price)
+ | from sales GROUP BY country""".stripMargin).show
+
+ spark.stop
+```
+
+##PRE-AGGREGATE TABLES
+ Carbondata supports pre aggregating of data so that OLAP kind of queries can fetch data
+ much faster.Aggregate tables are created as datamaps so that the handling is as efficient as
+ other indexing support.Users can create as many aggregate tables they require as datamaps to
+ improve their query performance,provided the storage requirements and loading speeds are
+ acceptable.
+
+ For main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE sales (
+ order_time timestamp,
+ user_id string,
+ sex string,
+ country string,
+ quantity int,
+ price bigint)
+ STORED BY 'carbondata'
+ ```
+
+ user can create pre-aggregate tables using the DDL
+
+ ```
+ CREATE DATAMAP agg_sales
+ ON TABLE sales
+ USING "preaggregate"
+ AS
+ SELECT country, sex, sum(quantity), avg(price)
+ FROM sales
+ GROUP BY country, sex
+ ```
+
+
+
+<b><p align="left">Functions supported in pre-aggregate tables</p></b>
+
+| Function | Rollup supported |
+|-----------|----------------|
+| SUM | Yes |
+| AVG | Yes |
+| MAX | Yes |
+| MIN | Yes |
+| COUNT | Yes |
+
+
+##### How pre-aggregate tables are selected
+For the main table **sales** and pre-aggregate table **agg_sales** created above, queries of the
+kind
+```
+SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex
+
+SELECT sex, sum(quantity) from sales GROUP BY sex
+
+SELECT sum(price), country from sales GROUP BY country
+```
+
+will be transformed by Query Planner to fetch data from pre-aggregate table **agg_sales**
+
+But queries of kind
+```
+SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex
+
+SELECT sex, avg(quantity) from sales GROUP BY sex
+
+SELECT country, max(price) from sales GROUP BY country
+```
+
+will fetch the data from the main table **sales**
+
+##### Loading data to pre-aggregate tables
+For existing table with loaded data, data load to pre-aggregate table will be triggered by the
+CREATE DATAMAP statement when user creates the pre-aggregate table.
+For incremental loads after aggregates tables are created, loading data to main table triggers
+the load to pre-aggregate tables once main table loading is complete. These loads are automic
+meaning that data on main table and aggregate tables are only visible to the user after all tables
+are loaded
+
+##### Querying data from pre-aggregate tables
+Pre-aggregate tables cannot be queries directly. Queries are to be made on main table. Internally
+carbondata will check associated pre-aggregate tables with the main table, and if the
+pre-aggregate tables satisfy the query condition, the plan is transformed automatically to use
+pre-aggregate table to fetch the data.
+
+##### Compacting pre-aggregate tables
+Compaction command (ALTER TABLE COMPACT) need to be run separately on each pre-aggregate table.
+Running Compaction command on main table will **not automatically** compact the pre-aggregate
+tables.Compaction is an optional operation for pre-aggregate table. If compaction is performed on
+main table but not performed on pre-aggregate table, all queries still can benefit from
+pre-aggregate tables. To further improve performance on pre-aggregate tables, compaction can be
+triggered on pre-aggregate tables directly, it will merge the segments inside pre-aggregate table.
+
+##### Update/Delete Operations on pre-aggregate tables
+This functionality is not supported.
+
+ NOTE (<b>RESTRICTION</b>):
+ Update/Delete operations are <b>not supported</b> on main table which has pre-aggregate tables
+ created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete
+ operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually
+ after update/delete operations are completed
+
+##### Delete Segment Operations on pre-aggregate tables
+This functionality is not supported.
+
+ NOTE (<b>RESTRICTION</b>):
+ Delete Segment operations are <b>not supported</b> on main table which has pre-aggregate tables
+ created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete
+ operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually
+ after delete segment operations are completed
+
+##### Alter Table Operations on pre-aggregate tables
+This functionality is not supported.
+
+ NOTE (<b>RESTRICTION</b>):
+ Adding new column in new table does not have any affect on pre-aggregate tables. However if
+ dropping or renaming a column has impact in pre-aggregate table, such operations will be
+ rejected and error will be thrown. All the pre-aggregate tables <b>will have to be dropped</b>
+ before Alter Operations can be performed on the main table. Pre-aggregate tables can be rebuilt
+ manually after Alter Table operations are completed
+
+### Supporting timeseries data (Alpha feature in 1.3.0)
--- End diff --

I think it is better we create a datamap folder under doc folder and put preaggregate guide and timeseries guide doc separately in datamap folder.

---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

In reply to this post by qiuchenjian-2

Github user jackylk commented on the issue:

https://github.com/apache/carbondata/pull/2022

LGTM

---

[GitHub] carbondata pull request #2022: [CARBONDATA-2098] Optimize pre-aggregate docu...

In reply to this post by qiuchenjian-2

Github user asfgit closed the pull request at:

https://github.com/apache/carbondata/pull/2022

---