Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2206: [WIP] Improve Lucene datamap performance by e...

Classic

List

19 messages Options

Options

[GitHub] carbondata pull request #2206: [WIP] Improve Lucene datamap performance by e...

GitHub user ravipesala opened a pull request:

https://github.com/apache/carbondata/pull/2206

[WIP] Improve Lucene datamap performance by eliminating blockid while writing and reading index.

This PR is depends on PR https://github.com/apache/carbondata/pull/2204
Problem:
Currently DataMap interface implementations use blockid and blockletid while writing index files, Actually blockid is not needed to store in index files as it only requires blockletid. So it adds more memory and disk size to write index files.

Solution:
Use taskname as index name to identify the indexname. And filter the blocklets directly by avoiding blockids.And pass the taskName as indexname to identify the blockid from blocletdatamap.

Corrected the implementations of LuceneDatamap, CGDataMap, FGDataMap and MinMaxDataMap

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?

- [ ] Any backward compatibility impacted?

- [ ] Document update required?

- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata improved-lucene

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2206

----
commit 219c217950737f42fd6e98f48f1a6968c0721774
Author: ravipesala <ravi.pesala@...>
Date: 2018-04-21T16:29:50Z

Added CG prune before FG prune.

commit 22c154791949676dc55a1ef1d704e05ec88552d8
Author: ravipesala <ravi.pesala@...>
Date: 2018-04-22T04:49:53Z

Improved Lucene datamap by compacting index size by eliminating blockId

----

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2206

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4115/

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2206

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5295/

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2206

@ravipesala Do you mean the blockletId will keep increasing in one task during one data loading, even if blocklets are in different block?

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2206

@ravipesala Bloom datamap has been merged, you need to correct it as well.

---

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2206#discussion_r183230325

--- Diff: core/src/main/java/org/apache/carbondata/core/datamap/Segment.java ---
@@ -39,6 +41,11 @@

private String segmentFileName;

+ /**
+ * List of tasks which are already got filtered through CG index operation.
+ */
+ private Set<String> filteredTaskNames = new HashSet<>();
--- End diff --

Instead of `taskName`, can we give a more formal one. I can suggest two: `indexShardName`, `segmentIndexName`

---

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2206#discussion_r183233543

--- Diff: core/src/main/java/org/apache/carbondata/core/datamap/Segment.java ---
@@ -39,6 +41,11 @@

private String segmentFileName;

+ /**
+ * List of tasks which are already got filtered through CG index operation.
+ */
+ private Set<String> filteredTaskNames = new HashSet<>();
--- End diff --

ok

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2206

@xuchuanyin yes, blockletids keeps increasing in one task. Now blocklets are with respect to one task(index file).

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2206

@xuchuanyin I have fixed the Bloom Filter as well.

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2206

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4124/

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2206

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5304/

---

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2206#discussion_r183239230

--- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java ---
@@ -54,7 +54,7 @@ public DataMapWriter(AbsoluteTableIdentifier identifier, Segment segment,
*
* @param blockId file name of the carbondata file
*/
- public abstract void onBlockStart(String blockId, long taskId) throws IOException;
+ public abstract void onBlockStart(String blockId, String taskName) throws IOException;
--- End diff --

All place in this PR which has `taskName` should be `indexShardName`, right?

---

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2206#discussion_r183269416

--- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java ---
@@ -54,7 +54,7 @@ public DataMapWriter(AbsoluteTableIdentifier identifier, Segment segment,
*
* @param blockId file name of the carbondata file
*/
- public abstract void onBlockStart(String blockId, long taskId) throws IOException;
+ public abstract void onBlockStart(String blockId, String taskName) throws IOException;
--- End diff --

ok

---

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

In reply to this post by qiuchenjian-2

Github user QiangCai commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2206#discussion_r183271230

--- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapWriter.java ---
@@ -102,24 +96,25 @@
this.indexedCarbonColumns = indexedCarbonColumns;
}

- private String getIndexPath(long taskId) {
+ private String getIndexPath(String taskName) {
if (isFineGrain) {
- return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName, taskId);
+ return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName,
+ taskName);
} else {
// TODO: where write data in coarse grain data map
- return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName, taskId);
+ return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName,
+ taskName);
}
}

/**
* Start of new block notification.
*/
- public void onBlockStart(String blockId, long taskId) throws IOException {
+ public void onBlockStart(String blockId, String indexShardName) throws IOException {
--- End diff --

In this method, we should only initialize indexWriter once.
It means all blocks of this task will share the index writer.

---

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

In reply to this post by qiuchenjian-2

Github user QiangCai commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2206#discussion_r183271696

--- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapWriter.java ---
@@ -102,24 +96,25 @@
this.indexedCarbonColumns = indexedCarbonColumns;
}

- private String getIndexPath(long taskId) {
+ private String getIndexPath(String taskName) {
if (isFineGrain) {
- return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName, taskId);
+ return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName,
+ taskName);
} else {
// TODO: where write data in coarse grain data map
- return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName, taskId);
+ return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName,
+ taskName);
}
}

/**
* Start of new block notification.
*/
- public void onBlockStart(String blockId, long taskId) throws IOException {
+ public void onBlockStart(String blockId, String indexShardName) throws IOException {
--- End diff --

if indexWriter != null then not required to set it again.

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2206

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4135/

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2206

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5315/

---

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

In reply to this post by qiuchenjian-2

Github user jackylk commented on the issue:

https://github.com/apache/carbondata/pull/2206

LGTM, CI has random failure

---

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

In reply to this post by qiuchenjian-2

Github user asfgit closed the pull request at:

https://github.com/apache/carbondata/pull/2206

---