[GitHub] carbondata pull request #2206: [WIP] Improve Lucene datamap performance by e...

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2206: [WIP] Improve Lucene datamap performance by e...

qiuchenjian-2
GitHub user ravipesala opened a pull request:

    https://github.com/apache/carbondata/pull/2206

    [WIP] Improve Lucene datamap performance by eliminating blockid while writing and reading index.

    This PR is depends on PR https://github.com/apache/carbondata/pull/2204
    Problem:
    Currently DataMap interface implementations use blockid and blockletid while writing index files, Actually blockid is not needed to store in index files as it only requires blockletid.  So it adds more memory and disk size to write index files.
   
    Solution:
    Use taskname as index name to identify the indexname. And filter the blocklets directly by avoiding blockids.And pass the taskName as indexname to identify the blockid from blocletdatamap.
   
    Corrected the implementations of  LuceneDatamap, CGDataMap, FGDataMap and MinMaxDataMap
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
   
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ravipesala/incubator-carbondata improved-lucene

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2206
   
----
commit 219c217950737f42fd6e98f48f1a6968c0721774
Author: ravipesala <ravi.pesala@...>
Date:   2018-04-21T16:29:50Z

    Added CG prune before FG prune.

commit 22c154791949676dc55a1ef1d704e05ec88552d8
Author: ravipesala <ravi.pesala@...>
Date:   2018-04-22T04:49:53Z

    Improved Lucene datamap by compacting index size by eliminating blockId

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4115/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5295/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    @ravipesala Do you mean the blockletId will keep increasing in one task during one data loading, even if blocklets are in different block?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    @ravipesala Bloom datamap has been merged, you need to correct it as well.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2206#discussion_r183230325
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/Segment.java ---
    @@ -39,6 +41,11 @@
     
       private String segmentFileName;
     
    +  /**
    +   * List of tasks which are already got filtered through CG index operation.
    +   */
    +  private Set<String> filteredTaskNames = new HashSet<>();
    --- End diff --
   
    Instead of `taskName`, can we give a more formal one. I can suggest two: `indexShardName`, `segmentIndexName`


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2206#discussion_r183233543
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/Segment.java ---
    @@ -39,6 +41,11 @@
     
       private String segmentFileName;
     
    +  /**
    +   * List of tasks which are already got filtered through CG index operation.
    +   */
    +  private Set<String> filteredTaskNames = new HashSet<>();
    --- End diff --
   
    ok


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    @xuchuanyin yes, blockletids keeps increasing in one task. Now blocklets are with respect to one task(index file).


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    @xuchuanyin I have fixed the Bloom Filter as well.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4124/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5304/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2206#discussion_r183239230
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java ---
    @@ -54,7 +54,7 @@ public DataMapWriter(AbsoluteTableIdentifier identifier, Segment segment,
        *
        * @param blockId file name of the carbondata file
        */
    -  public abstract void onBlockStart(String blockId, long taskId) throws IOException;
    +  public abstract void onBlockStart(String blockId, String taskName) throws IOException;
    --- End diff --
   
    All place in this PR which has `taskName` should be `indexShardName`, right?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2206#discussion_r183269416
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java ---
    @@ -54,7 +54,7 @@ public DataMapWriter(AbsoluteTableIdentifier identifier, Segment segment,
        *
        * @param blockId file name of the carbondata file
        */
    -  public abstract void onBlockStart(String blockId, long taskId) throws IOException;
    +  public abstract void onBlockStart(String blockId, String taskName) throws IOException;
    --- End diff --
   
    ok


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2206#discussion_r183271230
 
    --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapWriter.java ---
    @@ -102,24 +96,25 @@
         this.indexedCarbonColumns = indexedCarbonColumns;
       }
     
    -  private String getIndexPath(long taskId) {
    +  private String getIndexPath(String taskName) {
         if (isFineGrain) {
    -      return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName, taskId);
    +      return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName,
    +          taskName);
         } else {
           // TODO: where write data in coarse grain data map
    -      return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName, taskId);
    +      return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName,
    +          taskName);
         }
       }
     
       /**
        * Start of new block notification.
        */
    -  public void onBlockStart(String blockId, long taskId) throws IOException {
    +  public void onBlockStart(String blockId, String indexShardName) throws IOException {
    --- End diff --
   
    In this method, we should only initialize indexWriter once.
    It means all blocks of this task will share the index writer.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2206#discussion_r183271696
 
    --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapWriter.java ---
    @@ -102,24 +96,25 @@
         this.indexedCarbonColumns = indexedCarbonColumns;
       }
     
    -  private String getIndexPath(long taskId) {
    +  private String getIndexPath(String taskName) {
         if (isFineGrain) {
    -      return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName, taskId);
    +      return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName,
    +          taskName);
         } else {
           // TODO: where write data in coarse grain data map
    -      return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName, taskId);
    +      return genDataMapStorePathOnTaskId(identifier.getTablePath(), segmentId, dataMapName,
    +          taskName);
         }
       }
     
       /**
        * Start of new block notification.
        */
    -  public void onBlockStart(String blockId, long taskId) throws IOException {
    +  public void onBlockStart(String blockId, String indexShardName) throws IOException {
    --- End diff --
   
    if indexWriter != null then not required to set it again.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4135/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5315/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2206: [CARBONDATA-2376] Improve Lucene datamap performance...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/2206
 
    LGTM, CI has random failure


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2206: [CARBONDATA-2376] Improve Lucene datamap perf...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/2206


---