[GitHub] [carbondata] ajantha-bhat opened a new pull request #3932: [WIP] order by limit pushdown for array_contains filter

classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat opened a new pull request #3932: [WIP] order by limit pushdown for array_contains filter

GitBox

ajantha-bhat opened a new pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932


    ### Why is this PR needed?
   
   
    ### What changes were proposed in this PR?
   
       
    ### Does this PR introduce any user interface change?
    - No
    - Yes. (please explain the change and update document)
   
    ### Is any new testcase added?
    - No
    - Yes
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3932: [WIP] order by limit pushdown for array_contains filter

GitBox

CarbonDataQA1 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-693381520






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-693445107


   @QiangCai , @li36909 please check


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-693520481


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4108/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-693522190


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2366/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kumarvishal09 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

kumarvishal09 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-694359603


   @ajantha-bhat This change will give incorrect result if table has multiple sort columns and order by is on any other sort column other on first position, as apart from first sort column rest other will be partially sorted). I think this is very specific scenario handling.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kunal642 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

kunal642 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-695900793


   @ajantha-bhat TakeOrderedAndProject collects the data(taking limit into account) from multiple tasks and then performs a final sort, carbon cannot avoid this otherwise the sequence of results would be wrong


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kunal642 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

kunal642 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-695900793


   @ajantha-bhat TakeOrderedAndProject collects the data(taking limit into account) from multiple tasks and then performs a final sort, carbon cannot avoid this otherwise the sequence of results would be wrong


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-707070709


   @kumarvishal09 : Agree with you. It has to be the first sort_column only.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-707071580


   @kunal642 : This optimization is only when first sort column is in order by column.
   so, the data in each task is completely sorted. so applying limit on each task and order by once on the limit data will give correct results.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-707209130


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4392/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-707211367


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2640/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-707911553


   @QiangCai @kumarvishal09 : Yes, also it a specific scenario improvement. It has given very good improvement in one particular user scenario. Please check it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kunal642 commented on a change in pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is a first sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

kunal642 commented on a change in pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#discussion_r506085998



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/CarbonTakeOrderedAndProjectExec.scala
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Attribute, NamedExpression, SortOrder, UnsafeProjection}
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, SinglePartition}
+import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
+import org.apache.spark.util.Utils
+
+// To skip the order at map task
+case class CarbonTakeOrderedAndProjectExec(
+    limit: Int,
+    sortOrder: Seq[SortOrder],
+    projectList: Seq[NamedExpression],
+    child: SparkPlan,
+    skipMapOrder: Boolean = false,
+    readFromHead: Boolean = true) extends UnaryExecNode {

Review comment:
       CarbonTakeOrderedAndProjectExec should extend TakeOrderedAndProjectExec, and unmodified methods should not be overridden




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kunal642 commented on a change in pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is a first sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

kunal642 commented on a change in pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#discussion_r506086546



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
##########
@@ -984,6 +988,80 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
           null)(sparkSession)
     }
   }
+
+  object ExtractTakeOrderedAndProjectExec {
+
+    def unapply(plan: LogicalPlan): Option[CarbonTakeOrderedAndProjectExec] = {
+      val allRelations = plan.collect { case logicalRelation: LogicalRelation => logicalRelation }
+      // push down order by limit to carbon map task,
+      // only when there are only one CarbonDatasourceHadoopRelation
+      if (allRelations.size != 1 ||
+          allRelations.exists(x => !x.relation.isInstanceOf[CarbonDatasourceHadoopRelation])) {
+        return None
+      }
+      //  check and Replace TakeOrderedAndProject with CarbonTakeOrderedAndProjectExec.
+      val relation = allRelations.head.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
+      val sparkPlan = plan match {
+        case ReturnAnswer(rootPlan) => rootPlan match {
+          case Limit(IntegerLiteral(limit), Sort(order, true, child)) =>
+            TakeOrderedAndProjectExec(limit,
+              order,
+              child.output,
+              planLater(pushLimit(limit, child)))
+          case Limit(IntegerLiteral(limit), Project(projectList, Sort(order, true, child))) =>
+            TakeOrderedAndProjectExec(limit, order, projectList, planLater(pushLimit(limit, child)))

Review comment:
       instead of TakeOrderedAndProjectExec, we should directly prepare CarbonTakeOrderedAndProjectExec instance based on the checks below. I feel this TakeOrderedAndProjectExec creation is unnecessary




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is a first sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#discussion_r508460403



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
##########
@@ -984,6 +988,80 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
           null)(sparkSession)
     }
   }
+
+  object ExtractTakeOrderedAndProjectExec {
+
+    def unapply(plan: LogicalPlan): Option[CarbonTakeOrderedAndProjectExec] = {
+      val allRelations = plan.collect { case logicalRelation: LogicalRelation => logicalRelation }
+      // push down order by limit to carbon map task,
+      // only when there are only one CarbonDatasourceHadoopRelation
+      if (allRelations.size != 1 ||
+          allRelations.exists(x => !x.relation.isInstanceOf[CarbonDatasourceHadoopRelation])) {
+        return None
+      }
+      //  check and Replace TakeOrderedAndProject with CarbonTakeOrderedAndProjectExec.
+      val relation = allRelations.head.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
+      val sparkPlan = plan match {
+        case ReturnAnswer(rootPlan) => rootPlan match {
+          case Limit(IntegerLiteral(limit), Sort(order, true, child)) =>
+            TakeOrderedAndProjectExec(limit,
+              order,
+              child.output,
+              planLater(pushLimit(limit, child)))
+          case Limit(IntegerLiteral(limit), Project(projectList, Sort(order, true, child))) =>
+            TakeOrderedAndProjectExec(limit, order, projectList, planLater(pushLimit(limit, child)))

Review comment:
       ok. Handled.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is a first sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#discussion_r508460792



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/CarbonTakeOrderedAndProjectExec.scala
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Attribute, NamedExpression, SortOrder, UnsafeProjection}
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, SinglePartition}
+import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
+import org.apache.spark.util.Utils
+
+// To skip the order at map task
+case class CarbonTakeOrderedAndProjectExec(
+    limit: Int,
+    sortOrder: Seq[SortOrder],
+    projectList: Seq[NamedExpression],
+    child: SparkPlan,
+    skipMapOrder: Boolean = false,
+    readFromHead: Boolean = true) extends UnaryExecNode {

Review comment:
       Both are case class, so cannot extend this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is a first sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-712837186


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2784/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is a first sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-712853927


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4538/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3932: [CARBONDATA-3994] Skip Order by for map task if it is a first sort column and use limit pushdown for array_contains filter

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3932:
URL: https://github.com/apache/carbondata/pull/3932#issuecomment-713136611


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2794/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


12