[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

qiuchenjian-2
GitHub user kevinjmh opened a pull request:

    https://github.com/apache/carbondata/pull/2851

    [CARBONDATA-3040][BloomDataMap] Add checking before merging bloom index

    *Scene*
    There is a bug which causes query failure when we create two bloom datamaps on same table with data.  
   
    *Analyse*
    Since we already have data, each create datamap will trigger rebuild datamap task and then trigger bloom index file merging. By debuging, we found the first datamap's bloom index files would be merged two times and the second time made bloom index file empty.
   
    *Solution*
    Send the datamap name in rebuild event for filter. And add file check when merging.
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
   
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kevinjmh/carbondata fix_multi_bloom

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2851.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2851
   
----
commit bcab5ac630e39a7dadee09d5b9157642d061b5e1
Author: Manhua <kevinjmh@...>
Date:   2018-10-24T08:20:13Z

    only rebuild target datamap and add file check

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    please add description about why adding datamap name will solve the problem in the analyze part of your PR description


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    Another related question: What will happen if we create multiple datamaps concurrently for the same table?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/992/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1205/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9258/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kevinjmh commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    description updated


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2851#discussion_r228010573
 
    --- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/events/DataMapEvents.scala ---
    @@ -60,7 +60,8 @@ case class BuildDataMapPreExecutionEvent(sparkSession: SparkSession,
      * example: bloom datamap, Lucene datamap
      */
     case class BuildDataMapPostExecutionEvent(sparkSession: SparkSession,
    -    identifier: AbsoluteTableIdentifier, segmentIdList: Seq[String], isFromRebuild: Boolean)
    +    identifier: AbsoluteTableIdentifier, segmentIdList: Seq[String],
    +    isFromRebuild: Boolean, dmName: String)
    --- End diff --
   
    You can adjust the sequence of the parameters by moving `dmName` after `identifier`. This will make the method easy to understand: for some table's some datamap, for the corresponding segments doing rebuild or not


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2851#discussion_r228010750
 
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomIndexFileStore.java ---
    @@ -70,27 +73,37 @@ public boolean accept(CarbonFile file) {
           }
         });
     
    +    // check whether need to merge
         String mergeShardPath = dmSegmentPathString + File.separator + MERGE_BLOOM_INDEX_SHARD_NAME;
         String mergeInprogressFile = dmSegmentPathString + File.separator + MERGE_INPROGRESS_FILE;
         try {
    -      // delete mergeShard folder if exists
    -      if (FileFactory.isFileExist(mergeShardPath)) {
    -        FileFactory.deleteFile(mergeShardPath, FileFactory.getFileType(mergeShardPath));
    +      if (shardPaths.length == 0 || FileFactory.isFileExist(mergeShardPath)) {
    +        LOGGER.info("No shard data to merge or already merged for path " + mergeShardPath);
    --- End diff --
   
    when will this line be reached?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2851#discussion_r228009613
 
    --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeBloomIndexEventListener.scala ---
    @@ -48,8 +48,14 @@ class MergeBloomIndexEventListener extends OperationEventListener with Logging {
               _.getDataMapSchema.getProviderName.equalsIgnoreCase(
                 DataMapClassProvider.BLOOMFILTER.getShortName))
     
    -        // for load process, filter lazy datamap
    -        if (!datamapPostEvent.isFromRebuild) {
    +        if (datamapPostEvent.isFromRebuild) {
    +          if (null != datamapPostEvent.dmName) {
    +            // for rebuild process, event will be called for each datamap
    --- End diff --
   
    'for each datamap' or 'only for specific datamap'?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for merging ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/999/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for merging ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1212/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for merging ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9265/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for merging ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2851
 
    LGTM


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Fix bug for m...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/2851


---