[GitHub] [carbondata] ajantha-bhat opened a new pull request #3709: [HOTFIX] Avoid reading table status file again per query

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat opened a new pull request #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
ajantha-bhat opened a new pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709
 
 
    ### Why is this PR needed?
    After add segment feature, we take snapshot twice (which leads in two times table status IO), we can refuse the original snapshot
   a. First time snapshot in LateDecodeStrategy.pruneFilterProjectRaw()
   b. second time in CarbonTableInputFormat.getSplits()
   
    ### What changes were proposed in this PR?
   Take snapshot only once and reuse in the second place
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - No
   
   
       
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613567833
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2742/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613573305
 
 
   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1029/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613905604
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2745/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613906384
 
 
   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1032/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410565322
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
 ##########
 @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
       scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
         => RDD[InternalRow]): CodegenSupport = {
     val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
-    val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
-      new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+    val isTransactionalTable = table.carbonTable.isTransactionalTable
 
 Review comment:
   do not declare isTransactionalTable, use `table.carbonTable.isTransactionalTable` in line 451

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410566145
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
 ##########
 @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
       scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
         => RDD[InternalRow]): CodegenSupport = {
     val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
-    val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
-      new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+    val isTransactionalTable = table.carbonTable.isTransactionalTable
+    val readCommittedScope = new TableStatusReadCommittedScope(
 
 Review comment:
   This is only used to set in CarbonScanRDD, right?
   Can we move it to the constructor of CarbonScanRDD? Then we do not need to set it explicitly
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410585722
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
 ##########
 @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
       scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
         => RDD[InternalRow]): CodegenSupport = {
     val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
-    val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
-      new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+    val isTransactionalTable = table.carbonTable.isTransactionalTable
 
 Review comment:
   ok. done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410586507
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
 ##########
 @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
       scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
         => RDD[InternalRow]): CodegenSupport = {
     val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
-    val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
-      new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+    val isTransactionalTable = table.carbonTable.isTransactionalTable
+    val readCommittedScope = new TableStatusReadCommittedScope(
 
 Review comment:
   I thought about it, but multiple method signature need to modified to pass readcommitted scope till scanRDD.
   So, I used set method.
   
   Currrently, directFill and vectorReader variables are also set similar way

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410593681
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
 ##########
 @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
       scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
         => RDD[InternalRow]): CodegenSupport = {
     val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
-    val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
-      new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+    val isTransactionalTable = table.carbonTable.isTransactionalTable
+    val readCommittedScope = new TableStatusReadCommittedScope(
 
 Review comment:
   You mean many places created the readcommitted scope by themselves? But in that case, it will be hard to avoid using wrong committed scope in scanRDD.getPartition, like maybe in some flow someone created a scanRDD but forget to set the read committed scope.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410597974
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
 ##########
 @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
       scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
         => RDD[InternalRow]): CodegenSupport = {
     val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
-    val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
-      new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+    val isTransactionalTable = table.carbonTable.isTransactionalTable
+    val readCommittedScope = new TableStatusReadCommittedScope(
 
 Review comment:
   In carbonScanRDD line 140, I have added check for tableInputFormat instance. so, it won't use wrong readcommitted scope.
   Also only TableInputFormat case will use TableStatusReadCommittedScope, other case it is LatestFileReadCommittedScope.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-615566542
 
 
   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1060/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-615572949
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2773/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

GitBox
In reply to this post by GitBox
jackylk commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-616177802
 
 
   LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services