Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189435741 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,133 @@ +# CarbonData Lucene DataMap (Alpha feature in 1.4.0) + +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) --- End diff -- It's incorrect here: `data-management-with-pre-aggregate-tables` It should be `data-management-with-lucene-datamap` --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189435815 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,133 @@ +# CarbonData Lucene DataMap (Alpha feature in 1.4.0) + +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) --- End diff -- @jackylk I think it's better to add another document to describe the common operations for index datamap, since the descriptions for `Data Management`, `REBUILD DATAMAP`, `WITH DEFERRED REBUILD` are the same for `BloomFilterDataMap` and `LuceneDataMap`. --- |
In reply to this post by qiuchenjian-2
Github user akashrn5 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189507570 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,133 @@ +# CarbonData Lucene DataMap (Alpha feature in 1.4.0) + +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) + +#### DataMap Management +Lucene DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "lucene" + DMPROPERTIES ('index_columns'='city, name', ...) + ``` + +DataMap can be dropped using following DDL: + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Lucene DataMap Introduction + Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as + an index datamap and managed along with main tables by CarbonData.User can create lucene datamap + to improve query performance on string columns which has content of more length. + + For instance, main table called **datamap_test** which is defined as: + + ``` + CREATE TABLE datamap_test ( + name string, + age int, + city string, + country string) + STORED BY 'carbondata' + ``` + + User can create Lucene datamap using the Create DataMap DDL: + + ``` + CREATE DATAMAP dm + ON TABLE datamap_test + USING "lucene" + DMPROPERTIES ('INDEX_COLUMNS' = 'name, country') + ``` + +## Loading data +When loading data to main table, lucene index files will be generated for all the +index_columns(String Columns) given in DMProperties which contains information about the data +location of index_columns. These index files will be written inside a folder named with datamap name +inside each segment folders. + +A system level configuration carbon.lucene.compression.mode can be added for best compression of +lucene index files. The default value is speed, where the index writing speed will be more. If the +value is compression, the index file size will be compressed. + +## Querying data +As a technique for query acceleration, Lucene indexes cannot be queried directly. +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or +TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be +returned, if user does not specify this value, all results will be returned without any limit] is +fired, two jobs are fired.The first job writes the temporary files in folder created at table level +which contains lucene's seach results and these files will be read in second job to give faster +results. These temporary files will be cleared once the query finishes. + +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN` +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH() +filter is applied on query or not. + +Note: The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always in lower case and +filter condition like 'AND','OR' must be in upper case. + +Ex: ``` + select * from datamap_test where TEXT_MATCH('name:*10 AND name:*n*') + ``` + +Below like queries can be converted to text_match queries as following: +``` +select * from datamap_test where name='n10' + +select * from datamap_test where name like 'n1%' + +select * from datamap_test where name like '%10' + +select * from datamap_test where name like '%n%' + +select * from datamap_test where name like '%10' and name not like '%n%' +``` +Lucene TEXT_MATCH Queries: +``` +select * from datamap_test where TEXT_MATCH('name:n10') + +select * from datamap_test where TEXT_MATCH('name:n1*') + +select * from datamap_test where TEXT_MATCH('name:*10') + +select * from datamap_test where TEXT_MATCH('name:*n*') + +select * from datamap_test where TEXT_MATCH('name:*10 -name:*n*') --- End diff -- added a link, which will provide details of all these queries --- |
In reply to this post by qiuchenjian-2
Github user akashrn5 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189507649 --- Diff: docs/datamap/lucene-datamap-guide.md --- @@ -0,0 +1,133 @@ +# CarbonData Lucene DataMap (Alpha feature in 1.4.0) + +* [DataMap Management](#datamap-management) +* [Lucene Datamap](#lucene-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Data Management](#data-management-with-pre-aggregate-tables) --- End diff -- yes, i think the same, and about refresh im also not sure about how it works, so this PR will be specific to lucene, --- |
In reply to this post by qiuchenjian-2
Github user akashrn5 commented on the issue:
https://github.com/apache/carbondata/pull/2215 @xuchuanyin and @jackylk please review --- |
In reply to this post by qiuchenjian-2
Github user akashrn5 commented on the issue:
https://github.com/apache/carbondata/pull/2215 @chenliang613 please review and merge --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/2215 LGTM --- |
In reply to this post by qiuchenjian-2
|
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6005/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4846/ --- |
Free forum by Nabble | Edit this page |