Github user akashrn5 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184359005 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,180 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example --- End diff -- added numbering for steps and sub headings are kept as it is --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4256/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4264/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5431/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4564/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5442/ --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184665011 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,204 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME. + +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars. + ```shell + mvn clean package -DskipTests -Pspark-2.2 + ``` + +3. Start spark-shell in new terminal, type :paste, then copy and run the following code. + ```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("luceneDatamapExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country') + """.stripMargin) + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.stop + ``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL: + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as + index datamap and managed along with main tables by CarbonData.User can create lucene datamaps + to improve query performance on string columns. + + For instance, main table called **datamap_test** which is defined as: + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL: + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, lucene index files will be generated for all the +text_columns(String Columns) given in DMProperties which contains information about the data +location of text_columns. These index files will be written inside a folder named with datamap name +inside each segment folders. + +A system level configuration carbon.lucene.compression.mode can be added for best compression of +lucene index files. The default value is speed, where the index writing speed will be more. If the +value is compression, the index file size will be compressed. + +## Querying data +As a technique for query acceleration, Lucene indexes cannot be queried directly. +Queries are to be made on main table. when a query with TEXT_MATCH() is fired, two jobs are fired. --- End diff -- Now, there is one more UDF added (TEXT_MATCH_WITH_LIMIT), please add it also --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4577/ --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/2215 retest this please --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5597/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4436/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4438/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5599/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4692/ --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r185711684 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,213 @@ +# CarbonData Lucene DataMap + +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Quick example +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME. + +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars. + ```shell + mvn clean package -DskipTests -Pspark-2.2 + ``` + +3. Start spark-shell in new terminal, type :paste, then copy and run the following code. + ```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("luceneDatamapExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS datamap_test") + + // Create main table + spark.sql( + s""" + |CREATE TABLE datamap_test ( + |name string, + |age int, + |city string, + |country string) + |STORED BY 'carbondata' + """.stripMargin) + + // Create lucene datamap on the main table + spark.sql( + s""" + |CREATE DATAMAP dm + |ON TABLE datamap_test + |USING "lucene" + |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country') + """.stripMargin) + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + // Load data to the main table, if + // lucene index writing fails, the datamap + // will be disabled in query + val r = new Random() + spark.sparkContext.parallelize(1 to 10) + .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60)) + .toDF("name", "age", "city", "country") + .write + .format("carbondata") + .option("tableName", "datamap_test") + .option("compress", "true") + .mode(SaveMode.Append) + .save() + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10') + """.stripMargin).show + + spark.sql( + s""" + |SELECT * + |from datamap_test where + |TEXT_MATCH('name:c10', 10) + """.stripMargin).show + + spark.stop + ``` + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('text_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL: + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as + index datamap and managed along with main tables by CarbonData.User can create lucene datamaps + to improve query performance on string columns. + + For instance, main table called **datamap_test** which is defined as: + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL: + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('TEXT_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, lucene index files will be generated for all the +text_columns(String Columns) given in DMProperties which contains information about the data +location of text_columns. These index files will be written inside a folder named with datamap name +inside each segment folders. + +A system level configuration carbon.lucene.compression.mode can be added for best compression of +lucene index files. The default value is speed, where the index writing speed will be more. If the +value is compression, the index file size will be compressed. + +## Querying data +As a technique for query acceleration, Lucene indexes cannot be queried directly. +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or +TEXT_MATCH('name:n10',10)[the second parameter represents the number of result to be returned, if +user does not specify this value, all results will be returned without any limit] is fired, two jobs +are fired.The first job writes the temporary files in folder created at table level which contains +lucene's seach results and these files will be read in second job to give faster results. These +temporary files will be cleared once the query finishes. + +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN` +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH() +filter is applied on query or not. + +Below like queries can be converted to text_match queries as following: +``` +select * from datamap_test where name='n10' + +select * from datamap_test where name like 'n1%' + +select * from datamap_test where name like '%10' + +select * from datamap_test where name like '%n%' + +select * from datamap_test where name like '%10' and name not like '%n%' +``` +Lucene TEXT_MATCH Queries: +``` +select * from datamap_test where TEXT_MATCH('name:n10') + +select * from datamap_test where TEXT_MATCH('name:n1*') + +select * from datamap_test where TEXT_MATCH('name:*10') + +select * from datamap_test where TEXT_MATCH('name:*n*') + +select * from datamap_test where TEXT_MATCH('name:*10 and -name:*n*') --- End diff -- the syntax is wrong, don't need "and", should be TEXT_MATCH('name:*10 -name:*n*') --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4447/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5608/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4701/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5699/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4539/ --- |
Free forum by Nabble | Edit this page |