Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

Classic

List

Threaded

30 messages Options

qiuchenjian-2

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2632

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/25/

---

qiuchenjian-2

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2632#discussion_r216156161

--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
USING 'lucene'
DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
```
-
-**DMProperties**
-1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
-2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
- aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
- cardinality dimensions.
-3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
- folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
- also it makes lucene small chunks of data.
+**Properties for Lucene DataMap**
+
+| Property | Is Required | Default Value | Description |
+|-------------|----------|--------|---------|
+| INDEX_COLUMNS | YES | | Carbondata will generate Lucene index on these string columns. |
+| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
--- End diff --

also, what does the default value '-1' mean? It means it will use the maximum size of cache or it will not use any cache?

---

qiuchenjian-2

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2632#discussion_r216156187

--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
USING 'lucene'
DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
```
-
-**DMProperties**
-1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
-2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
- aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
- cardinality dimensions.
-3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
- folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
- also it makes lucene small chunks of data.
+**Properties for Lucene DataMap**
+
+| Property | Is Required | Default Value | Description |
+|-------------|----------|--------|---------|
+| INDEX_COLUMNS | YES | | Carbondata will generate Lucene index on these string columns. |
+| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
+| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", folders are created per blocklet by using the blockletID. This eliminates indexing blockletID by lucene by storing only pageID and rowID, thus reducing the size of indexes created by lucene. |
+
+**Folder Structure for lucene datamap:**
+ * Location of index files when Split BlockletId is TRUE:
+
+ tablePath/dataMapName/SegmentID/blockName/blockletID/..
+
+ * Location of index files when Split BlockletId is FALSE:
+
+ tablePath/dataMapName/SegmentID/blockName/..

## Loading data
-When loading data to main table, lucene index files will be generated for all the
-index_columns(String Columns) given in DMProperties which contains information about the data
-location of index_columns. These index files will be written inside a folder named with datamap name
-inside each segment folders.
+When loading data to main table, lucene index files will be generated for all the index_columns(String Columns) given in DMProperties which contains information about the data location of index_columns. These index files will be written into the path mentioned above.
--- End diff --

for all the index_columns(String Columns)
---
I think there is no need to mention 'String Columns' again since it is already mentioned in DMProperties

---

qiuchenjian-2

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2632#discussion_r216156194

--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
USING 'lucene'
DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
```
-
-**DMProperties**
-1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
-2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified then it tries to
- aggregate the unique data till the cache limit and flush to Lucene. It is best suitable for low
- cardinality dimensions.
-3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in lucene , it means new
- folder will be created for each blocklet, thus, it eliminates storing blockletid in lucene and
- also it makes lucene small chunks of data.
+**Properties for Lucene DataMap**
+
+| Property | Is Required | Default Value | Description |
+|-------------|----------|--------|---------|
+| INDEX_COLUMNS | YES | | Carbondata will generate Lucene index on these string columns. |
+| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in Lucene writer. If specified, it tries to aggregate the unique data till the cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE for low cardinality dimensions.|
+| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", folders are created per blocklet by using the blockletID. This eliminates indexing blockletID by lucene by storing only pageID and rowID, thus reducing the size of indexes created by lucene. |
+
+**Folder Structure for lucene datamap:**
+ * Location of index files when Split BlockletId is TRUE:
+
+ tablePath/dataMapName/SegmentID/blockName/blockletID/..
+
+ * Location of index files when Split BlockletId is FALSE:
+
+ tablePath/dataMapName/SegmentID/blockName/..

## Loading data
-When loading data to main table, lucene index files will be generated for all the
-index_columns(String Columns) given in DMProperties which contains information about the data
-location of index_columns. These index files will be written inside a folder named with datamap name
-inside each segment folders.
+When loading data to main table, lucene index files will be generated for all the index_columns(String Columns) given in DMProperties which contains information about the data location of index_columns. These index files will be written into the path mentioned above.

-A system level configuration carbon.lucene.compression.mode can be added for best compression of
-lucene index files. The default value is speed, where the index writing speed will be more. If the
-value is compression, the index file size will be compressed.
+A system level configuration carbon.lucene.compression.mode can be added for best compression of lucene index files. The default value is speed, where the index writing speed will be more. If the value is compression, the index file size will be compressed.
--- End diff --

You can quote the configuration like this `carbon.lucene.compression.mode`

---

qiuchenjian-2

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2632

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8783/

---

qiuchenjian-2

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2632

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/716/

---

qiuchenjian-2

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2632

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2166/

---

qiuchenjian-2

[GitHub] carbondata issue #2632: [CARBONDATA-2206] Enhanced document on Lucene datama...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2632

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10241/

---