[GitHub] [carbondata] VenuReddy2103 opened a new pull request #3616: Polygon expression processing using unknown expression and filtering performance improvement

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] VenuReddy2103 opened a new pull request #3616: Polygon expression processing using unknown expression and filtering performance improvement

GitBox
VenuReddy2103 opened a new pull request #3616: Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616
 
 
   
    ### Why is this PR needed?
   This PR improves the query processing performance of in_polygon UDF.
   
    ### What changes were proposed in this PR?
    At present, PolygonExpression processing leverages the existing InExpression. PolygonExpression internally creates a InExpression as a child to it. InExpression is constructed/build from the result of Quad tree algorithm. Algorithm returns the list of ranges(with each range having min and max Id for that range). And this list is a sorted one.
                 InExpression constitute of 2 childs. One child is a columnExpression(for geohash column) and the other is a ListExpression( with List of LiternalExpressions. One LiteralExpression for each Id returned from algo).
   **Problems associated with this approach.**
   - We expand the list of ranges(with each range having minand max) to all individual Ids. And create LiteralExpression for each Id. Since we can have large ranges(and the numerous ranges), it consumes huge amount of memory in processing.
   - Due to same reason, it slows does the filter execution.
   
   Modifications with this PR:
   Instead we can use UnknownExpression with RowLevelFilterResolverImpl and RowLevelFilterExecuterImpl processing. And override evaluate() method to do the binary  searchon the list of ranges directly. This will significanly inprove the polygon filter query performance.
       
    ### Does this PR introduce any user interface change?
    - Yes. Need to update the design document.
   
    ### Is any new testcase added?
    - Yes. Added an end to end test case
   
       
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3616: Polygon expression processing using unknown expression and filtering performance improvement

GitBox
CarbonDataQA1 commented on issue #3616: Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616#issuecomment-585178192
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/263/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3616: Polygon expression processing using unknown expression and filtering performance improvement

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3616: Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616#issuecomment-585199012
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1966/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616#discussion_r378268495
 
 

 ##########
 File path: geo/src/main/java/org/apache/carbondata/geo/scan/expression/PolygonExpression.java
 ##########
 @@ -41,64 +41,85 @@
  * InExpression with list of all the IDs present in those list of ranges.
  */
 @InterfaceAudience.Internal
-public class PolygonExpression extends Expression {
+public class PolygonExpression extends UnknownExpression implements ConditionalExpression {
   private String polygon;
-  private String columnName;
   private CustomIndex<List<Long[]>> handler;
-  private List<Expression> children = new ArrayList<Expression>();
+  private List<Long[]> ranges = new ArrayList<Long[]>();
+  private ColumnExpression column;
+  private ExpressionResult trueExpRes;
+  private ExpressionResult falseExpRes;
 
   public PolygonExpression(String polygon, String columnName, CustomIndex handler) {
     this.polygon = polygon;
     this.handler = handler;
-    this.columnName = columnName;
+    this.column = new ColumnExpression(columnName, DataTypes.LONG);
+    this.trueExpRes = new ExpressionResult(DataTypes.BOOLEAN, true);
+    this.falseExpRes = new ExpressionResult(DataTypes.BOOLEAN, false);
   }
 
-  private void buildExpression(List<Long[]> ranges) {
-    // Build InExpression with list of all the values present in the ranges
-    List<Expression> inList = new ArrayList<Expression>();
+  private void validate(List<Long[]> ranges) {
+    // Validate the ranges
     for (Long[] range : ranges) {
       if (range.length != 2) {
         throw new RuntimeException("Handler query must return list of ranges with each range "
             + "containing minimum and maximum values");
       }
-      for (long i = range[0]; i <= range[1]; i++) {
-        inList.add(new LiteralExpression(i, DataTypes.LONG));
-      }
     }
-    children.add(new InExpression(new ColumnExpression(columnName, DataTypes.LONG),
-        new ListExpression(inList)));
   }
 
   /**
-   * This method builds InExpression with list of all the values present in the list of ranges of
-   * IDs.
+   * This method calls the query processor and gets the list of ranges of IDs.
    */
   private void processExpression() {
-    List<Long[]> ranges;
     try {
       ranges = handler.query(polygon);
+      validate(ranges);
     } catch (Exception e) {
       throw new RuntimeException(e);
     }
-    buildExpression(ranges);
+  }
+
+  private boolean rangeBinarySearch(List<Long[]> ranges, long searchForNumber) {
 
 Review comment:
   Please use collections.binarysearch()

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616#discussion_r378275250
 
 

 ##########
 File path: geo/src/main/java/org/apache/carbondata/geo/scan/expression/PolygonExpression.java
 ##########
 @@ -41,64 +41,85 @@
  * InExpression with list of all the IDs present in those list of ranges.
  */
 @InterfaceAudience.Internal
-public class PolygonExpression extends Expression {
+public class PolygonExpression extends UnknownExpression implements ConditionalExpression {
   private String polygon;
-  private String columnName;
   private CustomIndex<List<Long[]>> handler;
-  private List<Expression> children = new ArrayList<Expression>();
+  private List<Long[]> ranges = new ArrayList<Long[]>();
+  private ColumnExpression column;
+  private ExpressionResult trueExpRes;
+  private ExpressionResult falseExpRes;
 
   public PolygonExpression(String polygon, String columnName, CustomIndex handler) {
     this.polygon = polygon;
     this.handler = handler;
-    this.columnName = columnName;
+    this.column = new ColumnExpression(columnName, DataTypes.LONG);
+    this.trueExpRes = new ExpressionResult(DataTypes.BOOLEAN, true);
+    this.falseExpRes = new ExpressionResult(DataTypes.BOOLEAN, false);
   }
 
-  private void buildExpression(List<Long[]> ranges) {
-    // Build InExpression with list of all the values present in the ranges
-    List<Expression> inList = new ArrayList<Expression>();
+  private void validate(List<Long[]> ranges) {
+    // Validate the ranges
     for (Long[] range : ranges) {
       if (range.length != 2) {
         throw new RuntimeException("Handler query must return list of ranges with each range "
             + "containing minimum and maximum values");
       }
-      for (long i = range[0]; i <= range[1]; i++) {
-        inList.add(new LiteralExpression(i, DataTypes.LONG));
-      }
     }
-    children.add(new InExpression(new ColumnExpression(columnName, DataTypes.LONG),
-        new ListExpression(inList)));
   }
 
   /**
-   * This method builds InExpression with list of all the values present in the list of ranges of
-   * IDs.
+   * This method calls the query processor and gets the list of ranges of IDs.
    */
   private void processExpression() {
-    List<Long[]> ranges;
     try {
       ranges = handler.query(polygon);
+      validate(ranges);
     } catch (Exception e) {
       throw new RuntimeException(e);
     }
-    buildExpression(ranges);
+  }
+
+  private boolean rangeBinarySearch(List<Long[]> ranges, long searchForNumber) {
+    Long[] range;
+    int low = 0, mid, high = ranges.size() - 1;
+    while (low <= high) {
+      mid = low + ((high - low) / 2);
+      range = ranges.get(mid);
+      if (searchForNumber >= range[0]) {
+        if (searchForNumber <= range[1]) {
+          // Return true if the number is between min and max values of the range
+          return true;
+        } else {
+          // Number is bigger than this range's min and max. Search on the right side of the range
+          low = mid + 1;
+        }
+      } else {
+        // Number is smaller than this range's min and max. Search on the left side of the range
+        high = mid - 1;
+      }
+    }
+    return false;
   }
 
   @Override
   public ExpressionResult evaluate(RowIntf value) {
-    throw new UnsupportedOperationException("Operation not supported for Polygon expression");
+    if (rangeBinarySearch(ranges, (Long) value.getVal(0))) {
 
 Review comment:
   Data is already sorted. So may be in future can take advantage of it , instead of checking each row exist in a range.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement

GitBox
In reply to this post by GitBox
VenuReddy2103 commented on a change in pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616#discussion_r378293172
 
 

 ##########
 File path: geo/src/main/java/org/apache/carbondata/geo/scan/expression/PolygonExpression.java
 ##########
 @@ -41,64 +41,85 @@
  * InExpression with list of all the IDs present in those list of ranges.
  */
 @InterfaceAudience.Internal
-public class PolygonExpression extends Expression {
+public class PolygonExpression extends UnknownExpression implements ConditionalExpression {
   private String polygon;
-  private String columnName;
   private CustomIndex<List<Long[]>> handler;
-  private List<Expression> children = new ArrayList<Expression>();
+  private List<Long[]> ranges = new ArrayList<Long[]>();
+  private ColumnExpression column;
+  private ExpressionResult trueExpRes;
+  private ExpressionResult falseExpRes;
 
   public PolygonExpression(String polygon, String columnName, CustomIndex handler) {
     this.polygon = polygon;
     this.handler = handler;
-    this.columnName = columnName;
+    this.column = new ColumnExpression(columnName, DataTypes.LONG);
+    this.trueExpRes = new ExpressionResult(DataTypes.BOOLEAN, true);
+    this.falseExpRes = new ExpressionResult(DataTypes.BOOLEAN, false);
   }
 
-  private void buildExpression(List<Long[]> ranges) {
-    // Build InExpression with list of all the values present in the ranges
-    List<Expression> inList = new ArrayList<Expression>();
+  private void validate(List<Long[]> ranges) {
+    // Validate the ranges
     for (Long[] range : ranges) {
       if (range.length != 2) {
         throw new RuntimeException("Handler query must return list of ranges with each range "
             + "containing minimum and maximum values");
       }
-      for (long i = range[0]; i <= range[1]; i++) {
-        inList.add(new LiteralExpression(i, DataTypes.LONG));
-      }
     }
-    children.add(new InExpression(new ColumnExpression(columnName, DataTypes.LONG),
-        new ListExpression(inList)));
   }
 
   /**
-   * This method builds InExpression with list of all the values present in the list of ranges of
-   * IDs.
+   * This method calls the query processor and gets the list of ranges of IDs.
    */
   private void processExpression() {
-    List<Long[]> ranges;
     try {
       ranges = handler.query(polygon);
+      validate(ranges);
     } catch (Exception e) {
       throw new RuntimeException(e);
     }
-    buildExpression(ranges);
+  }
+
+  private boolean rangeBinarySearch(List<Long[]> ranges, long searchForNumber) {
 
 Review comment:
   Did the binary seach on ranges directly instead of expanding ranges to values in another list and doing collections.binarysearch() on the resultant one. Tried to avoid the unnecessary expansion of ranges. It can reduce the search complexity to order of ln(number of list of ranges).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement

GitBox
In reply to this post by GitBox
VenuReddy2103 commented on a change in pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616#discussion_r378293942
 
 

 ##########
 File path: geo/src/main/java/org/apache/carbondata/geo/scan/expression/PolygonExpression.java
 ##########
 @@ -41,64 +41,85 @@
  * InExpression with list of all the IDs present in those list of ranges.
  */
 @InterfaceAudience.Internal
-public class PolygonExpression extends Expression {
+public class PolygonExpression extends UnknownExpression implements ConditionalExpression {
   private String polygon;
-  private String columnName;
   private CustomIndex<List<Long[]>> handler;
-  private List<Expression> children = new ArrayList<Expression>();
+  private List<Long[]> ranges = new ArrayList<Long[]>();
+  private ColumnExpression column;
+  private ExpressionResult trueExpRes;
+  private ExpressionResult falseExpRes;
 
   public PolygonExpression(String polygon, String columnName, CustomIndex handler) {
     this.polygon = polygon;
     this.handler = handler;
-    this.columnName = columnName;
+    this.column = new ColumnExpression(columnName, DataTypes.LONG);
+    this.trueExpRes = new ExpressionResult(DataTypes.BOOLEAN, true);
+    this.falseExpRes = new ExpressionResult(DataTypes.BOOLEAN, false);
   }
 
-  private void buildExpression(List<Long[]> ranges) {
-    // Build InExpression with list of all the values present in the ranges
-    List<Expression> inList = new ArrayList<Expression>();
+  private void validate(List<Long[]> ranges) {
+    // Validate the ranges
     for (Long[] range : ranges) {
       if (range.length != 2) {
         throw new RuntimeException("Handler query must return list of ranges with each range "
             + "containing minimum and maximum values");
       }
-      for (long i = range[0]; i <= range[1]; i++) {
-        inList.add(new LiteralExpression(i, DataTypes.LONG));
-      }
     }
-    children.add(new InExpression(new ColumnExpression(columnName, DataTypes.LONG),
-        new ListExpression(inList)));
   }
 
   /**
-   * This method builds InExpression with list of all the values present in the list of ranges of
-   * IDs.
+   * This method calls the query processor and gets the list of ranges of IDs.
    */
   private void processExpression() {
-    List<Long[]> ranges;
     try {
       ranges = handler.query(polygon);
+      validate(ranges);
     } catch (Exception e) {
       throw new RuntimeException(e);
     }
-    buildExpression(ranges);
+  }
+
+  private boolean rangeBinarySearch(List<Long[]> ranges, long searchForNumber) {
+    Long[] range;
+    int low = 0, mid, high = ranges.size() - 1;
+    while (low <= high) {
+      mid = low + ((high - low) / 2);
+      range = ranges.get(mid);
+      if (searchForNumber >= range[0]) {
+        if (searchForNumber <= range[1]) {
+          // Return true if the number is between min and max values of the range
+          return true;
+        } else {
+          // Number is bigger than this range's min and max. Search on the right side of the range
+          low = mid + 1;
+        }
+      } else {
+        // Number is smaller than this range's min and max. Search on the left side of the range
+        high = mid - 1;
+      }
+    }
+    return false;
   }
 
   @Override
   public ExpressionResult evaluate(RowIntf value) {
-    throw new UnsupportedOperationException("Operation not supported for Polygon expression");
+    if (rangeBinarySearch(ranges, (Long) value.getVal(0))) {
 
 Review comment:
   Yeah Agreed. Will think of it in future.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on issue #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement

GitBox
In reply to this post by GitBox
ajantha-bhat commented on issue #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616#issuecomment-585240655
 
 
   LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] asfgit closed pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement

GitBox
In reply to this post by GitBox
asfgit closed pull request #3616: [CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement
URL: https://github.com/apache/carbondata/pull/3616
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services