[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
GitHub user praveenmeenakshi56 opened a pull request:

    https://github.com/apache/carbondata/pull/2632

    [CARBONDATA-2206] Enhanced document on Lucene datamap Support

    Enhanced documentation of Lucene DataMap
   
     - [ ] Any interfaces changed?
    NA
     - [ ] Any backward compatibility impacted?
    NA
     - [ ] Document update required?
    Document Updated
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
    NA
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
    NA


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/praveenmeenakshi56/carbondata lucene_doc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2632.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2632
   
----
commit 15f2929f0dafbf7b7d7f8a62c5ae9d2f66955528
Author: praveenmeenakshi56 <praveenmeenakshi56@...>
Date:   2018-08-13T07:15:02Z

    Updated document on Lucene datamap Support

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2632
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6252/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2632
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6253/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2632
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7896/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2632
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6620/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2632
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6255/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2632
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6623/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2632
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6256/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2632
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7899/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210796794
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -44,12 +44,8 @@ To show all DataMaps created, use:
       ```
     It will show all DataMaps created on main table.
     
    -
     ## Lucene DataMap Introduction
    -  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    -  an index datamap and managed along with main tables by CarbonData.User can create lucene datamap
    -  to improve query performance on string columns which has content of more length. So, user can
    -  search tokenized word or pattern of it using lucene query on text content.
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as an index datamap and managed along with main tables by CarbonData.User can create lucene datamap to improve query performance on string columns which has content of more length. So, user can search tokenized word or pattern of it using lucene query on text content.
    --- End diff --
   
    is it for string content of more length? or is it natural language sentences stored as string?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210796724
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -44,12 +44,8 @@ To show all DataMaps created, use:
       ```
     It will show all DataMaps created on main table.
     
    -
     ## Lucene DataMap Introduction
    -  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    -  an index datamap and managed along with main tables by CarbonData.User can create lucene datamap
    -  to improve query performance on string columns which has content of more length. So, user can
    -  search tokenized word or pattern of it using lucene query on text content.
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as an index datamap and managed along with main tables by CarbonData.User can create lucene datamap to improve query performance on string columns which has content of more length. So, user can search tokenized word or pattern of it using lucene query on text content.
    --- End diff --
   
    use consistent naming.some places carbon, some places carbondata


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210797479
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
    -   aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
    -   folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    +
    +| Property | Is Required | Default Value | Description |
    +|-------------|----------|--------|---------|
    +| INDEX_COLUMNS | YES |  | Carbondata will generate Lucene index on these string columns. |
    +| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
    +| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", folders are created per blocklet by using the blockletID. This eliminates indexing blockletID by lucene by storing only pageID and rowID, thus reducing the size of indexes created by lucene. |
    --- End diff --
   
    what happens when false?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210797397
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
    -   aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
    -   folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    +
    +| Property | Is Required | Default Value | Description |
    +|-------------|----------|--------|---------|
    +| INDEX_COLUMNS | YES |  | Carbondata will generate Lucene index on these string columns. |
    +| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
    --- End diff --
   
    explanation is not clear.why it is recommended for low cardinality columns?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210798605
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -152,25 +144,17 @@ select * from datamap_test where TEXT_MATCH('name:*10 -name:*n*')
     **Note:** For lucene queries and syntax, refer to [lucene-syntax](www.lucenetutorial.com/lucene-query-syntax.html)
     
     ## Data Management with lucene datamap
    -Once there is lucene datamap is created on the main table, following command on the main
    -table
    -is not supported:
    +Once lucene datamap is created on the main table, following command on the main table is not supported:
     1. Data management command: `UPDATE/DELETE`.
     2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`,
     `ALTER TABLE RENAME`.
     
    -**Note**: Adding a new column is supported, and for dropping columns and change datatype
    -command, CarbonData will check whether it will impact the lucene datamap, if not, the operation
    -is allowed, otherwise operation will be rejected by throwing exception.
    -
    +**Note**: Adding a new column is supported, and for dropping columns and change datatype command, CarbonData will check whether it will impact the lucene datamap, if not, the operation is allowed, otherwise operation will be rejected by throwing exception.
     
     3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`.
     
    -However, there is still way to support these operations on main table, in current CarbonData
    -release, user can do as following:
    +However, there is still way to support these operations on main table, in current CarbonData release, user can do as following:
    --- End diff --
   
    not the right sentence to specify how to achieve the functionality


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210797873
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
    -   aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
    -   folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    --- End diff --
   
    we need to mention what type of data types are supported as a separate section


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210797946
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
    -   aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
    -   folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    +
    +| Property | Is Required | Default Value | Description |
    +|-------------|----------|--------|---------|
    +| INDEX_COLUMNS | YES |  | Carbondata will generate Lucene index on these string columns. |
    +| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
    +| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", folders are created per blocklet by using the blockletID. This eliminates indexing blockletID by lucene by storing only pageID and rowID, thus reducing the size of indexes created by lucene. |
    +
    +**Folder Structure for lucene datamap:**
    +  * Location of index files when Split BlockletId is TRUE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/blockletID/..
    +
    +  * Location of index files when Split BlockletId is FALSE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/..
       
     ## Loading data
    -When loading data to main table, lucene index files will be generated for all the
    -index_columns(String Columns) given in DMProperties which contains information about the data
    -location of index_columns. These index files will be written inside a folder named with datamap name
    -inside each segment folders.
    +When loading data to main table, lucene index files will be generated for all the index_columns(String Columns) given in DMProperties which contains information about the data location of index_columns. These index files will be written into the path mentioned above.
    --- End diff --
   
    do we configure location?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210798420
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
    -   aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
    -   folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    +
    +| Property | Is Required | Default Value | Description |
    +|-------------|----------|--------|---------|
    +| INDEX_COLUMNS | YES |  | Carbondata will generate Lucene index on these string columns. |
    +| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
    +| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", folders are created per blocklet by using the blockletID. This eliminates indexing blockletID by lucene by storing only pageID and rowID, thus reducing the size of indexes created by lucene. |
    +
    +**Folder Structure for lucene datamap:**
    +  * Location of index files when Split BlockletId is TRUE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/blockletID/..
    +
    +  * Location of index files when Split BlockletId is FALSE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/..
       
     ## Loading data
    -When loading data to main table, lucene index files will be generated for all the
    -index_columns(String Columns) given in DMProperties which contains information about the data
    -location of index_columns. These index files will be written inside a folder named with datamap name
    -inside each segment folders.
    +When loading data to main table, lucene index files will be generated for all the index_columns(String Columns) given in DMProperties which contains information about the data location of index_columns. These index files will be written into the path mentioned above.
     
    -A system level configuration carbon.lucene.compression.mode can be added for best compression of
    -lucene index files. The default value is speed, where the index writing speed will be more. If the
    -value is compression, the index file size will be compressed.
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of lucene index files. The default value is speed, where the index writing speed will be more. If the value is compression, the index file size will be compressed.
     
     ## Querying data
     As a technique for query acceleration, Lucene indexes cannot be queried directly.
    -Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or
    -TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be
    -returned, if user does not specify this value, all results will be returned without any limit] is
    -fired, two jobs are fired.The first job writes the temporary files in folder created at table level
    -which contains lucene's seach results and these files will be read in second job to give faster
    -results. These temporary files will be cleared once the query finishes.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be returned, if user does not specify this value, all results will be returned without any limit] is fired, two jobs are fired. The first job performs pruning based on filter values and writes the lucene search results into temporary files in the dataMap folder created at table level. These files will be read during the second job (filter execution) to give faster results. These temporary files will be cleared once the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN` command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH() filter is applied on query or not.
     
    -User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    -command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    -filter is applied on query or not.
    +**NOTE:** Temporary files will contain blockletId, pageId, and rowId of filter query.
    --- End diff --
   
    whats the use of this note?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210797610
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
    -   aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
    -   folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    +
    +| Property | Is Required | Default Value | Description |
    +|-------------|----------|--------|---------|
    +| INDEX_COLUMNS | YES |  | Carbondata will generate Lucene index on these string columns. |
    +| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
    +| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", folders are created per blocklet by using the blockletID. This eliminates indexing blockletID by lucene by storing only pageID and rowID, thus reducing the size of indexes created by lucene. |
    --- End diff --
   
    configuring false would increase the size of indexes?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210798045
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
    -   aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
    -   folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    +
    +| Property | Is Required | Default Value | Description |
    +|-------------|----------|--------|---------|
    +| INDEX_COLUMNS | YES |  | Carbondata will generate Lucene index on these string columns. |
    +| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
    +| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", folders are created per blocklet by using the blockletID. This eliminates indexing blockletID by lucene by storing only pageID and rowID, thus reducing the size of indexes created by lucene. |
    +
    +**Folder Structure for lucene datamap:**
    +  * Location of index files when Split BlockletId is TRUE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/blockletID/..
    +
    +  * Location of index files when Split BlockletId is FALSE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/..
       
     ## Loading data
    -When loading data to main table, lucene index files will be generated for all the
    -index_columns(String Columns) given in DMProperties which contains information about the data
    -location of index_columns. These index files will be written inside a folder named with datamap name
    -inside each segment folders.
    +When loading data to main table, lucene index files will be generated for all the index_columns(String Columns) given in DMProperties which contains information about the data location of index_columns. These index files will be written into the path mentioned above.
     
    -A system level configuration carbon.lucene.compression.mode can be added for best compression of
    -lucene index files. The default value is speed, where the index writing speed will be more. If the
    -value is compression, the index file size will be compressed.
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of lucene index files. The default value is speed, where the index writing speed will be more. If the value is compression, the index file size will be compressed.
    --- End diff --
   
    what are the other possible options?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210798390
 
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
    -   aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
    -   folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    +
    +| Property | Is Required | Default Value | Description |
    +|-------------|----------|--------|---------|
    +| INDEX_COLUMNS | YES |  | Carbondata will generate Lucene index on these string columns. |
    +| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
    +| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", folders are created per blocklet by using the blockletID. This eliminates indexing blockletID by lucene by storing only pageID and rowID, thus reducing the size of indexes created by lucene. |
    +
    +**Folder Structure for lucene datamap:**
    +  * Location of index files when Split BlockletId is TRUE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/blockletID/..
    +
    +  * Location of index files when Split BlockletId is FALSE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/..
       
     ## Loading data
    -When loading data to main table, lucene index files will be generated for all the
    -index_columns(String Columns) given in DMProperties which contains information about the data
    -location of index_columns. These index files will be written inside a folder named with datamap name
    -inside each segment folders.
    +When loading data to main table, lucene index files will be generated for all the index_columns(String Columns) given in DMProperties which contains information about the data location of index_columns. These index files will be written into the path mentioned above.
     
    -A system level configuration carbon.lucene.compression.mode can be added for best compression of
    -lucene index files. The default value is speed, where the index writing speed will be more. If the
    -value is compression, the index file size will be compressed.
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of lucene index files. The default value is speed, where the index writing speed will be more. If the value is compression, the index file size will be compressed.
     
     ## Querying data
     As a technique for query acceleration, Lucene indexes cannot be queried directly.
    -Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or
    -TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be
    -returned, if user does not specify this value, all results will be returned without any limit] is
    -fired, two jobs are fired.The first job writes the temporary files in folder created at table level
    -which contains lucene's seach results and these files will be read in second job to give faster
    -results. These temporary files will be cleared once the query finishes.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be returned, if user does not specify this value, all results will be returned without any limit] is fired, two jobs are fired. The first job performs pruning based on filter values and writes the lucene search results into temporary files in the dataMap folder created at table level. These files will be read during the second job (filter execution) to give faster results. These temporary files will be cleared once the query finishes.
    --- End diff --
   
    sentence can be written to specify the lucene UDFs we are using as means of firing query to lucent


---
12