[GitHub] [carbondata] kevinjmh commented on a change in pull request #3444: [CARBONDATA-3581] Support page level bloom filter

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kevinjmh commented on a change in pull request #3444: [CARBONDATA-3581] Support page level bloom filter

GitBox
kevinjmh commented on a change in pull request #3444: [CARBONDATA-3581] Support page level bloom filter
URL: https://github.com/apache/carbondata/pull/3444#discussion_r349481780
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java
 ##########
 @@ -217,6 +220,14 @@ public BitSet prunePages(RawBlockletColumnChunks rawBlockletColumnChunks)
           bitSet.set(i);
         }
       }
 
 Review comment:
    I re-think about this.Different to minmax, page bloom costs more(storage, decode) when query. Row level filter may not benefit from bloom.
   
   If we use **one filter column** and get multiple columns, once bloom says that page does not need to scan, nothing need to do for all columns of this page in direct fill case. If bloom can skip more pages, the IO benefit for skipped pages is `# of pages * project columns`.
   
   As for row filter, for same query, the benefit is shrank to `# of pages * 1 (the filter column)`.
   
   For row level filter:
   ```
   1.original:
   
   decode page -> for-loop checking each value of this column
   
   2.with page bloom:
   
   read bloom chunk -> decode bloom bitmap -> check bloom
   
   if check result is false -> skip
   if check result is true ->  decode page -> for-loop checking each value of this column
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services