[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
GitHub user ravipesala opened a pull request:

    https://github.com/apache/carbondata/pull/2327

    [WIP] Bloom remove guava cache and use CarbonCache

    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
   
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ravipesala/incubator-carbondata bloom-remove-guava-cache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2327
   
----
commit 28cd56e71a11386582ef5ccd877babbbca703336
Author: ravipesala <ravi.pesala@...>
Date:   2018-05-20T16:22:57Z

    Changed to hadoop bloom implementation and added compress option to compress bloom on disk.

commit 4acfbfae68737046401ffef472a91c9aecb6e26d
Author: ravipesala <ravi.pesala@...>
Date:   2018-05-21T04:38:38Z

    Added CarbonData cache instead of guava cache.

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [WIP] Bloom remove guava cache and use CarbonCache

qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6008/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [WIP] Bloom remove guava cache and use CarbonCache

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4849/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [WIP] Bloom remove guava cache and use CarbonCache

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5026/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r189769888
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/cache/CacheProvider.java ---
    @@ -101,6 +102,31 @@ public static CacheProvider getInstance() {
         return cacheTypeToCacheMap.get(cacheType);
       }
     
    +  /**
    +   * This method will check if a cache already exists for given cache type and store
    +   * if it is not present in the map
    +   */
    +  public <K, V> Cache<K, V> createCache(CacheType cacheType, String cacheClassName)
    --- End diff --
   
    Can we specify the size reserved for each type of cache?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r189770357
 
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java ---
    @@ -94,20 +83,21 @@ public void setIndexedColumn(Set<String> indexedColumn) {
         List<BloomQueryModel> bloomQueryModels = getQueryValue(filterExp.getFilterExpression());
         for (BloomQueryModel bloomQueryModel : bloomQueryModels) {
           LOGGER.debug("prune blocklet for query: " + bloomQueryModel);
    -      BloomDataMapCache.CacheKey cacheKey = new BloomDataMapCache.CacheKey(
    +      BloomCacheKeyValue.CacheKey cacheKey = new BloomCacheKeyValue.CacheKey(
               this.indexPath.toString(), bloomQueryModel.columnName);
    -      List<BloomDMModel> bloomDMModels = this.bloomDataMapCache.getBloomDMModelByKey(cacheKey);
    -      for (BloomDMModel bloomDMModel : bloomDMModels) {
    -        boolean scanRequired = bloomDMModel.getBloomFilter().membershipTest(new Key(
    +      BloomCacheKeyValue.CacheValue cacheValue = cache.get(cacheKey);
    +      List<CarbonBloomFilter> bloomIndexList = cacheValue.getBloomFilters();
    +      for (CarbonBloomFilter bloomFilter : bloomIndexList) {
    +        boolean scanRequired = bloomFilter.membershipTest(new Key(
                 convertValueToBytes(bloomQueryModel.dataType, bloomQueryModel.filterValue)));
             if (scanRequired) {
               LOGGER.debug(String.format("BloomCoarseGrainDataMap: Need to scan -> blocklet#%s",
    -              String.valueOf(bloomDMModel.getBlockletNo())));
    -          Blocklet blocklet = new Blocklet(shardName, String.valueOf(bloomDMModel.getBlockletNo()));
    +              String.valueOf(bloomFilter.getBlockletNo())));
    --- End diff --
   
    I think `BloomDMModel` is better than `CarbonBloomFilter` in code understanding.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r189770038
 
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCacheKeyValue.java ---
    @@ -0,0 +1,105 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.carbondata.datamap.bloom;
    +
    +import java.io.Serializable;
    +import java.util.List;
    +import java.util.Objects;
    +
    +import org.apache.carbondata.core.cache.Cacheable;
    +
    +import org.apache.hadoop.util.bloom.CarbonBloomFilter;
    +
    +/**
    + * Key and values of bloom to keep in cache.
    + */
    +public class BloomCacheKeyValue {
    +
    +  public static class CacheKey implements Serializable {
    +
    +    private static final long serialVersionUID = -1478238084352505372L;
    +    private String shardPath;
    +    private String indexColumn;
    +
    +    CacheKey(String shardPath, String indexColumn) {
    +      this.shardPath = shardPath;
    +      this.indexColumn = indexColumn;
    +    }
    +
    +    public String getShardPath() {
    +      return shardPath;
    +    }
    +
    +    public String getIndexColumn() {
    +      return indexColumn;
    +    }
    +
    +    @Override
    +    public String toString() {
    +      final StringBuilder sb = new StringBuilder("CacheKey{");
    +      sb.append("shardPath='").append(shardPath).append('\'');
    +      sb.append(", indexColumn='").append(indexColumn).append('\'');
    +      sb.append('}');
    +      return sb.toString();
    +    }
    +
    +    @Override
    +    public boolean equals(Object o) {
    +      if (this == o) return true;
    +      if (!(o instanceof CacheKey)) return false;
    +      CacheKey cacheKey = (CacheKey) o;
    +      return Objects.equals(shardPath, cacheKey.shardPath)
    +          && Objects.equals(indexColumn, cacheKey.indexColumn);
    +    }
    +
    +    @Override
    +    public int hashCode() {
    +      return Objects.hash(shardPath, indexColumn);
    +    }
    +  }
    +
    +  public static class CacheValue implements Cacheable {
    +
    +    private List<CarbonBloomFilter> bloomFilters;
    +
    +    private int size;
    +
    +    public CacheValue(List<CarbonBloomFilter> bloomFilters) {
    +      this.bloomFilters = bloomFilters;
    +      for (CarbonBloomFilter bloomFilter : bloomFilters) {
    +        size += bloomFilter.getSize();
    +      }
    +    }
    +
    +    @Override public long getFileTimeStamp() {
    --- End diff --
   
    Move `Override` to the previous line, the same for the following methods


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r189881797
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/cache/CacheProvider.java ---
    @@ -101,6 +102,31 @@ public static CacheProvider getInstance() {
         return cacheTypeToCacheMap.get(cacheType);
       }
     
    +  /**
    +   * This method will check if a cache already exists for given cache type and store
    +   * if it is not present in the map
    +   */
    +  public <K, V> Cache<K, V> createCache(CacheType cacheType, String cacheClassName)
    --- End diff --
   
    No, we cannot specify for one type, we can only configure the whole lru cache size either on driver or executor.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r189882515
 
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java ---
    @@ -94,20 +83,21 @@ public void setIndexedColumn(Set<String> indexedColumn) {
         List<BloomQueryModel> bloomQueryModels = getQueryValue(filterExp.getFilterExpression());
         for (BloomQueryModel bloomQueryModel : bloomQueryModels) {
           LOGGER.debug("prune blocklet for query: " + bloomQueryModel);
    -      BloomDataMapCache.CacheKey cacheKey = new BloomDataMapCache.CacheKey(
    +      BloomCacheKeyValue.CacheKey cacheKey = new BloomCacheKeyValue.CacheKey(
               this.indexPath.toString(), bloomQueryModel.columnName);
    -      List<BloomDMModel> bloomDMModels = this.bloomDataMapCache.getBloomDMModelByKey(cacheKey);
    -      for (BloomDMModel bloomDMModel : bloomDMModels) {
    -        boolean scanRequired = bloomDMModel.getBloomFilter().membershipTest(new Key(
    +      BloomCacheKeyValue.CacheValue cacheValue = cache.get(cacheKey);
    +      List<CarbonBloomFilter> bloomIndexList = cacheValue.getBloomFilters();
    +      for (CarbonBloomFilter bloomFilter : bloomIndexList) {
    +        boolean scanRequired = bloomFilter.membershipTest(new Key(
                 convertValueToBytes(bloomQueryModel.dataType, bloomQueryModel.filterValue)));
             if (scanRequired) {
               LOGGER.debug(String.format("BloomCoarseGrainDataMap: Need to scan -> blocklet#%s",
    -              String.valueOf(bloomDMModel.getBlockletNo())));
    -          Blocklet blocklet = new Blocklet(shardName, String.valueOf(bloomDMModel.getBlockletNo()));
    +              String.valueOf(bloomFilter.getBlockletNo())));
    --- End diff --
   
    I feel extra object is not required to just store blockletid, so ideally one bloomfilter belongs to one blocklet thats why moved inside carbonbloomfilter


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r189882537
 
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCacheKeyValue.java ---
    @@ -0,0 +1,105 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.carbondata.datamap.bloom;
    +
    +import java.io.Serializable;
    +import java.util.List;
    +import java.util.Objects;
    +
    +import org.apache.carbondata.core.cache.Cacheable;
    +
    +import org.apache.hadoop.util.bloom.CarbonBloomFilter;
    +
    +/**
    + * Key and values of bloom to keep in cache.
    + */
    +public class BloomCacheKeyValue {
    +
    +  public static class CacheKey implements Serializable {
    +
    +    private static final long serialVersionUID = -1478238084352505372L;
    +    private String shardPath;
    +    private String indexColumn;
    +
    +    CacheKey(String shardPath, String indexColumn) {
    +      this.shardPath = shardPath;
    +      this.indexColumn = indexColumn;
    +    }
    +
    +    public String getShardPath() {
    +      return shardPath;
    +    }
    +
    +    public String getIndexColumn() {
    +      return indexColumn;
    +    }
    +
    +    @Override
    +    public String toString() {
    +      final StringBuilder sb = new StringBuilder("CacheKey{");
    +      sb.append("shardPath='").append(shardPath).append('\'');
    +      sb.append(", indexColumn='").append(indexColumn).append('\'');
    +      sb.append('}');
    +      return sb.toString();
    +    }
    +
    +    @Override
    +    public boolean equals(Object o) {
    +      if (this == o) return true;
    +      if (!(o instanceof CacheKey)) return false;
    +      CacheKey cacheKey = (CacheKey) o;
    +      return Objects.equals(shardPath, cacheKey.shardPath)
    +          && Objects.equals(indexColumn, cacheKey.indexColumn);
    +    }
    +
    +    @Override
    +    public int hashCode() {
    +      return Objects.hash(shardPath, indexColumn);
    +    }
    +  }
    +
    +  public static class CacheValue implements Cacheable {
    +
    +    private List<CarbonBloomFilter> bloomFilters;
    +
    +    private int size;
    +
    +    public CacheValue(List<CarbonBloomFilter> bloomFilters) {
    +      this.bloomFilters = bloomFilters;
    +      for (CarbonBloomFilter bloomFilter : bloomFilters) {
    +        size += bloomFilter.getSize();
    +      }
    +    }
    +
    +    @Override public long getFileTimeStamp() {
    --- End diff --
   
    ok


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [WIP] Bloom remove guava cache and use CarbonCache

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6049/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [WIP] Bloom remove guava cache and use CarbonCache

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4890/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [WIP] Bloom remove guava cache and use CarbonCache

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5054/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r190447991
 
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java ---
    @@ -94,20 +83,21 @@ public void setIndexedColumn(Set<String> indexedColumn) {
         List<BloomQueryModel> bloomQueryModels = getQueryValue(filterExp.getFilterExpression());
         for (BloomQueryModel bloomQueryModel : bloomQueryModels) {
           LOGGER.debug("prune blocklet for query: " + bloomQueryModel);
    -      BloomDataMapCache.CacheKey cacheKey = new BloomDataMapCache.CacheKey(
    +      BloomCacheKeyValue.CacheKey cacheKey = new BloomCacheKeyValue.CacheKey(
               this.indexPath.toString(), bloomQueryModel.columnName);
    -      List<BloomDMModel> bloomDMModels = this.bloomDataMapCache.getBloomDMModelByKey(cacheKey);
    -      for (BloomDMModel bloomDMModel : bloomDMModels) {
    -        boolean scanRequired = bloomDMModel.getBloomFilter().membershipTest(new Key(
    +      BloomCacheKeyValue.CacheValue cacheValue = cache.get(cacheKey);
    +      List<CarbonBloomFilter> bloomIndexList = cacheValue.getBloomFilters();
    +      for (CarbonBloomFilter bloomFilter : bloomIndexList) {
    +        boolean scanRequired = bloomFilter.membershipTest(new Key(
                 convertValueToBytes(bloomQueryModel.dataType, bloomQueryModel.filterValue)));
             if (scanRequired) {
               LOGGER.debug(String.format("BloomCoarseGrainDataMap: Need to scan -> blocklet#%s",
    -              String.valueOf(bloomDMModel.getBlockletNo())));
    -          Blocklet blocklet = new Blocklet(shardName, String.valueOf(bloomDMModel.getBlockletNo()));
    +              String.valueOf(bloomFilter.getBlockletNo())));
    --- End diff --
   
    If so, I don't think it's good to call it `CarbonBloomFilter`, sounds like that it's an implementation of BloomFilter in Carbondata, but actually it contains business-related(blockletId for a datamapindex) information.
    What I mean is that we can change it to another name like `BloomDMModel`


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r190448358
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/cache/CacheProvider.java ---
    @@ -101,6 +102,31 @@ public static CacheProvider getInstance() {
         return cacheTypeToCacheMap.get(cacheType);
       }
     
    +  /**
    +   * This method will check if a cache already exists for given cache type and store
    +   * if it is not present in the map
    +   */
    +  public <K, V> Cache<K, V> createCache(CacheType cacheType, String cacheClassName)
    --- End diff --
   
    Then it will be a disadvantage compared to guava-cache. Besides, I think it's quite useful to configure the size of cache in fine granularity.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2327: [WIP] Bloom remove guava cache and use Carbon...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2327#discussion_r191163607
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/cache/CacheProvider.java ---
    @@ -101,6 +102,31 @@ public static CacheProvider getInstance() {
         return cacheTypeToCacheMap.get(cacheType);
       }
     
    +  /**
    +   * This method will check if a cache already exists for given cache type and store
    +   * if it is not present in the map
    +   */
    +  public <K, V> Cache<K, V> createCache(CacheType cacheType, String cacheClassName)
    --- End diff --
   
    It is not disadvantaged, we should give only one cache size to configure, we cannot give different cache sizes for different features. It is difficult to maintain and difficult for the user to estimate the memory he can allocate for the cache. It is one of the main reason we want to bring all caches into single configurable size.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [CARBONDATA-2549] Bloom remove guava cache and use C...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4964/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [CARBONDATA-2549] Bloom remove guava cache and use C...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6126/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [CARBONDATA-2549] Bloom remove guava cache and use C...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5115/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2327: [CARBONDATA-2549] Bloom remove guava cache and use C...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2327
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6336/



---
12