Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

Classic

List

15 messages Options

Options

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

GitHub user kevinjmh opened a pull request:

https://github.com/apache/carbondata/pull/2851

[CARBONDATA-3040][BloomDataMap] Add checking before merging bloom index

*Scene*
There is a bug which causes query failure when we create two bloom datamaps on same table with data.

*Analyse*
Since we already have data, each create datamap will trigger rebuild datamap task and then trigger bloom index file merging. By debuging, we found the first datamap's bloom index files would be merged two times and the second time made bloom index file empty.

*Solution*
Send the datamap name in rebuild event for filter. And add file check when merging.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?

- [ ] Any backward compatibility impacted?

- [ ] Document update required?

- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kevinjmh/carbondata fix_multi_bloom

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2851.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2851

----
commit bcab5ac630e39a7dadee09d5b9157642d061b5e1
Author: Manhua <kevinjmh@...>
Date: 2018-10-24T08:20:13Z

only rebuild target datamap and add file check

----

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2851

please add description about why adding datamap name will solve the problem in the analyze part of your PR description

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2851

Another related question: What will happen if we create multiple datamaps concurrently for the same table?

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2851

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/992/

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2851

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1205/

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2851

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9258/

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

In reply to this post by qiuchenjian-2

Github user kevinjmh commented on the issue:

https://github.com/apache/carbondata/pull/2851

description updated

---

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2851#discussion_r228010573

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/events/DataMapEvents.scala ---
@@ -60,7 +60,8 @@ case class BuildDataMapPreExecutionEvent(sparkSession: SparkSession,
* example: bloom datamap, Lucene datamap
*/
case class BuildDataMapPostExecutionEvent(sparkSession: SparkSession,
- identifier: AbsoluteTableIdentifier, segmentIdList: Seq[String], isFromRebuild: Boolean)
+ identifier: AbsoluteTableIdentifier, segmentIdList: Seq[String],
+ isFromRebuild: Boolean, dmName: String)
--- End diff --

You can adjust the sequence of the parameters by moving `dmName` after `identifier`. This will make the method easy to understand: for some table's some datamap, for the corresponding segments doing rebuild or not

---

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2851#discussion_r228010750

--- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomIndexFileStore.java ---
@@ -70,27 +73,37 @@ public boolean accept(CarbonFile file) {
}
});

+ // check whether need to merge
String mergeShardPath = dmSegmentPathString + File.separator + MERGE_BLOOM_INDEX_SHARD_NAME;
String mergeInprogressFile = dmSegmentPathString + File.separator + MERGE_INPROGRESS_FILE;
try {
- // delete mergeShard folder if exists
- if (FileFactory.isFileExist(mergeShardPath)) {
- FileFactory.deleteFile(mergeShardPath, FileFactory.getFileType(mergeShardPath));
+ if (shardPaths.length == 0 || FileFactory.isFileExist(mergeShardPath)) {
+ LOGGER.info("No shard data to merge or already merged for path " + mergeShardPath);
--- End diff --

when will this line be reached?

---

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2851#discussion_r228009613

--- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeBloomIndexEventListener.scala ---
@@ -48,8 +48,14 @@ class MergeBloomIndexEventListener extends OperationEventListener with Logging {
_.getDataMapSchema.getProviderName.equalsIgnoreCase(
DataMapClassProvider.BLOOMFILTER.getShortName))

- // for load process, filter lazy datamap
- if (!datamapPostEvent.isFromRebuild) {
+ if (datamapPostEvent.isFromRebuild) {
+ if (null != datamapPostEvent.dmName) {
+ // for rebuild process, event will be called for each datamap
--- End diff --

'for each datamap' or 'only for specific datamap'?

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for merging ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2851

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/999/

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for merging ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2851

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1212/

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for merging ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2851

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9265/

---

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for merging ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2851

LGTM

---

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for m...

In reply to this post by qiuchenjian-2

Github user asfgit closed the pull request at:

https://github.com/apache/carbondata/pull/2851

---