GitHub user akashrn5 opened a pull request:
https://github.com/apache/carbondata/pull/2215 [wip]add documentation for lucene datamap added documentation for lucene datamap Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/akashrn5/incubator-carbondata doc_lucene Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2215.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2215 ---- commit 5403c832ca98569f60acf42a95c42ae21d8d3be5 Author: akashrn5 <akashnilugal@...> Date: 2018-04-23T13:57:56Z add documentation for lucene datamap ---- --- |
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4490/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4175/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5343/ --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616213 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example --- End diff -- The below is a procedure, so put it in a numbered list: Step 1: Step 2: --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183615908 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME --- End diff -- These are procedure steps, so we can have numbered list --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617096 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result --- End diff -- end all sentence with a period (.) --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617702 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result + + For instance, main table called **sales** which is defined as + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is, +then lucene index files will be generated for all the text_columns (String Columns) given in +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the +data of text_columns. These index files will be written inside a folder named as datamap name inside +each segment directories. + +## Querying data +As a technique for query acceleration, Lucene indexes cannot be queried directly. +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as +pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each +blocklet a temporary file will be generated which has information till row level, but prune will +return blocklets finally. + +When query reaches executor side, the temporary files written will be read and bitset groups are +formed to return the query result. + +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN` +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH() +filter is applied on query or not. + + +## Data Management with pre-aggregate tables +Once there is lucene datamap is created on the main table, following command on the main --- End diff -- Once lucene datamap is created on the main table, following command on the main table is not supported: --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617239 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result + + For instance, main table called **sales** which is defined as + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL --- End diff -- User can create Lucene datamap using the Create DataMap DDL: --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617996 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result + + For instance, main table called **sales** which is defined as + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is, +then lucene index files will be generated for all the text_columns (String Columns) given in +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the +data of text_columns. These index files will be written inside a folder named as datamap name inside +each segment directories. + +## Querying data +As a technique for query acceleration, Lucene indexes cannot be queried directly. +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as +pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each +blocklet a temporary file will be generated which has information till row level, but prune will +return blocklets finally. + +When query reaches executor side, the temporary files written will be read and bitset groups are +formed to return the query result. + +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN` +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH() +filter is applied on query or not. + + +## Data Management with pre-aggregate tables +Once there is lucene datamap is created on the main table, following command on the main +table +is not supported: +1. Data management command: `UPDATE/DELETE`. +2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`, +`ALTER TABLE RENAME`. Note that adding a new column is supported, and for dropping columns and --- End diff -- **Note:** Use this format for Note and start in a new line --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616317 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME --- End diff -- Close all the sentence with a period (.). This is applicable for all the sentences in this topics. --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617201 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result + + For instance, main table called **sales** which is defined as --- End diff -- For instance, main table called **sales** which is defined as: --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616769 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL --- End diff -- DataMap can be dropped using following DDL: --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616653 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show --- End diff -- Why a red background. Please check once --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616698 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL --- End diff -- Lucene DataMap can be created using following DDL: --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183618083 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result + + For instance, main table called **sales** which is defined as + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is, +then lucene index files will be generated for all the text_columns (String Columns) given in +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the +data of text_columns. These index files will be written inside a folder named as datamap name inside +each segment directories. + +## Querying data +As a technique for query acceleration, Lucene indexes cannot be queried directly. +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as +pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each +blocklet a temporary file will be generated which has information till row level, but prune will +return blocklets finally. + +When query reaches executor side, the temporary files written will be read and bitset groups are +formed to return the query result. + +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN` +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH() +filter is applied on query or not. + + +## Data Management with pre-aggregate tables +Once there is lucene datamap is created on the main table, following command on the main +table +is not supported: +1. Data management command: `UPDATE/DELETE`. +2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`, +`ALTER TABLE RENAME`. Note that adding a new column is supported, and for dropping columns and +change datatype command, CarbonData will check whether it will impact the lucene datamap, if + not, the operation is allowed, otherwise operation will be rejected by throwing exception. +3. Partition management command: `ALTER TABLE ADD/DROP PARTITION` + +However, there is still way to support these operations on main table, in current CarbonData +release, user can do as following: +1. Remove the lucene datamap by `DROP DATAMAP` command --- End diff -- End all sentences with a period (.) --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2215 The word wrap is strange, better to write a paragraph and let the editor do the rest. --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184317255 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result + + For instance, main table called **sales** which is defined as + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is, +then lucene index files will be generated for all the text_columns (String Columns) given in +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the +data of text_columns. These index files will be written inside a folder named as datamap name inside +each segment directories. + +## Querying data +As a technique for query acceleration, Lucene indexes cannot be queried directly. +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as +pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each +blocklet a temporary file will be generated which has information till row level, but prune will +return blocklets finally. + +When query reaches executor side, the temporary files written will be read and bitset groups are +formed to return the query result. --- End diff -- please mention the cleanup procedure for temp files --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184316398 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result + + For instance, main table called **sales** which is defined as + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is, +then lucene index files will be generated for all the text_columns (String Columns) given in +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the +data of text_columns. These index files will be written inside a folder named as datamap name inside +each segment directories. + +## Querying data +As a technique for query acceleration, Lucene indexes cannot be queried directly. +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as --- End diff -- Please add the details to mention supported syntax is lucene query. And list few example queries which can cover tokenezier based search and like queries --- |
In reply to this post by qiuchenjian-2
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184316832 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop +``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData. + User can create as many lucene datamaps required to improve query performance, + provided the storage requirements and loading speeds are acceptable. + + Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till + row level for the filter query by launching a spark datamap job. This pruned data will be read to + give the proper and faster result + + For instance, main table called **sales** which is defined as + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is, --- End diff -- What new configurations added and how it can impact data load can be added. Example:- compression types --- |
Free forum by Nabble | Edit this page |