Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] ajantha-bhat opened a new pull request #3709: [HOTFIX] Avoid reading table status file again per query

Classic

List

14 messages Options

Options

GitBox

[GitHub] [carbondata] ajantha-bhat opened a new pull request #3709: [HOTFIX] Avoid reading table status file again per query

ajantha-bhat opened a new pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709

### Why is this PR needed?
After add segment feature, we take snapshot twice (which leads in two times table status IO), we can refuse the original snapshot
a. First time snapshot in LateDecodeStrategy.pruneFilterProjectRaw()
b. second time in CarbonTableInputFormat.getSplits()

### What changes were proposed in this PR?
Take snapshot only once and reuse in the second place

### Does this PR introduce any user interface change?
- No

### Is any new testcase added?
- No

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613567833

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2742/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613573305

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1029/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613905604

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2745/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-613906384

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1032/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410565322

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
##########
@@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
=> RDD[InternalRow]): CodegenSupport = {
val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
- val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
- new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+ val isTransactionalTable = table.carbonTable.isTransactionalTable

Review comment:
do not declare isTransactionalTable, use `table.carbonTable.isTransactionalTable` in line 451

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410566145

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
##########
@@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
=> RDD[InternalRow]): CodegenSupport = {
val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
- val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
- new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+ val isTransactionalTable = table.carbonTable.isTransactionalTable
+ val readCommittedScope = new TableStatusReadCommittedScope(

Review comment:
This is only used to set in CarbonScanRDD, right?
Can we move it to the constructor of CarbonScanRDD? Then we do not need to set it explicitly

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410585722

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
##########
@@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
=> RDD[InternalRow]): CodegenSupport = {
val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
- val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
- new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+ val isTransactionalTable = table.carbonTable.isTransactionalTable

Review comment:
ok. done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410586507

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
##########
@@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
=> RDD[InternalRow]): CodegenSupport = {
val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
- val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
- new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+ val isTransactionalTable = table.carbonTable.isTransactionalTable
+ val readCommittedScope = new TableStatusReadCommittedScope(

Review comment:
I thought about it, but multiple method signature need to modified to pass readcommitted scope till scanRDD.
So, I used set method.

Currrently, directFill and vectorReader variables are also set similar way

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

jackylk commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410593681

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
##########
@@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
=> RDD[InternalRow]): CodegenSupport = {
val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
- val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
- new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+ val isTransactionalTable = table.carbonTable.isTransactionalTable
+ val readCommittedScope = new TableStatusReadCommittedScope(

Review comment:
You mean many places created the readcommitted scope by themselves? But in that case, it will be hard to avoid using wrong committed scope in scanRDD.getPartition, like maybe in some flow someone created a scanRDD but forget to set the read committed scope.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#discussion_r410597974

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/CarbonLateDecodeStrategy.scala
##########
@@ -349,8 +349,15 @@ private[sql] class CarbonLateDecodeStrategy extends SparkStrategy {
scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter], Seq[PartitionSpec])
=> RDD[InternalRow]): CodegenSupport = {
val table = relation.relation.asInstanceOf[CarbonDatasourceHadoopRelation]
- val extraRdd = MixedFormatHandler.extraRDD(relation, rawProjects, filterPredicates,
- new TableStatusReadCommittedScope(table.identifier, FileFactory.getConfiguration),
+ val isTransactionalTable = table.carbonTable.isTransactionalTable
+ val readCommittedScope = new TableStatusReadCommittedScope(

Review comment:
In carbonScanRDD line 140, I have added check for tableInputFormat instance. so, it won't use wrong readcommitted scope.
Also only TableInputFormat case will use TableStatusReadCommittedScope, other case it is LatestFileReadCommittedScope.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-615566542

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1060/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-615572949

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2773/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on issue #3709: [HOTFIX] Avoid reading table status file again per query

In reply to this post by GitBox

jackylk commented on issue #3709: [HOTFIX] Avoid reading table status file again per query
URL: https://github.com/apache/carbondata/pull/3709#issuecomment-616177802

LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services