ajantha-bhat opened a new pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709 ### Why is this PR needed? After add segment feature, we take snapshot twice (which leads in two times table status IO), we can refuse the original snapshot a. First time snapshot in LateDecodeStrategy.pruneFilterProjectRaw() b. second time in CarbonTableInputFormat.getSplits() ### What changes were proposed in this PR? Take snapshot only once and reuse in the second place ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - No ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613567833 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2742/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613573305 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1029/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613905604 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2745/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613906384 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1032/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410565322 ########## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala ########## @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy { scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec]) => RDD[InternalRow]): CodegenSupport = { val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation] - val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates, - new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration), + val isTransactionalTable = table.carbonTable.isTransactionalTable Review comment: do not declare isTransactionalTable, use `table.carbonTable.isTransactionalTable` in line 451 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410566145 ########## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala ########## @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy { scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec]) => RDD[InternalRow]): CodegenSupport = { val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation] - val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates, - new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration), + val isTransactionalTable = table.carbonTable.isTransactionalTable + val readCommittedScope = new TableStatusReadCommittedScope( Review comment: This is only used to set in CarbonScanRDD, right? Can we move it to the constructor of CarbonScanRDD? Then we do not need to set it explicitly ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410585722 ########## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala ########## @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy { scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec]) => RDD[InternalRow]): CodegenSupport = { val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation] - val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates, - new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration), + val isTransactionalTable = table.carbonTable.isTransactionalTable Review comment: ok. done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410586507 ########## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala ########## @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy { scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec]) => RDD[InternalRow]): CodegenSupport = { val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation] - val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates, - new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration), + val isTransactionalTable = table.carbonTable.isTransactionalTable + val readCommittedScope = new TableStatusReadCommittedScope( Review comment: I thought about it, but multiple method signature need to modified to pass readcommitted scope till scanRDD. So, I used set method. Currrently, directFill and vectorReader variables are also set similar way ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410593681 ########## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala ########## @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy { scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec]) => RDD[InternalRow]): CodegenSupport = { val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation] - val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates, - new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration), + val isTransactionalTable = table.carbonTable.isTransactionalTable + val readCommittedScope = new TableStatusReadCommittedScope( Review comment: You mean many places created the readcommitted scope by themselves? But in that case, it will be hard to avoid using wrong committed scope in scanRDD.getPartition, like maybe in some flow someone created a scanRDD but forget to set the read committed scope. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410597974 ########## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala ########## @@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy { scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec]) => RDD[InternalRow]): CodegenSupport = { val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation] - val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates, - new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration), + val isTransactionalTable = table.carbonTable.isTransactionalTable + val readCommittedScope = new TableStatusReadCommittedScope( Review comment: In carbonScanRDD line 140, I have added check for tableInputFormat instance. so, it won't use wrong readcommitted scope. Also only TableInputFormat case will use TableStatusReadCommittedScope, other case it is LatestFileReadCommittedScope. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-615566542 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1060/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-615572949 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2773/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-616177802 LGTM ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |