Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] Indhumathi27 opened a new pull request #3846: [WIP] Fix CDC delete data on partition table

Classic

List

17 messages Options

Options

GitBox

[GitHub] [carbondata] Indhumathi27 opened a new pull request #3846: [WIP] Fix CDC delete data on partition table

Indhumathi27 opened a new pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846

### Why is this PR needed?

### What changes were proposed in this PR?

### Does this PR introduce any user interface change?
- No
- Yes. (please explain the change and update document)

### Is any new testcase added?
- No
- Yes

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3846: [WIP] Fix CDC delete data on partition table

CarbonDataQA1 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-658779577

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1656/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3846: [WIP] Fix CDC delete data on partition table

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-658780719

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3398/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-658879364

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3400/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-658880484

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1658/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

akashrn5 commented on a change in pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#discussion_r457857926

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -175,6 +178,21 @@ case class CarbonMergeDataSetCommand(
LOGGER.error("writing of update status file failed")
throw new CarbonMergeDataSetException("writing of update status file failed")
}
+ if (carbonTable.isHivePartitionTable) {
+ // If load count is 0 and if merge action contains delete operation, update
+ // tableUpdateStatus file name in loadMeta entry
+ if (count == 0 && hasDelAction && !tuple._1.isEmpty) {
+ val loadMetaDataDetails = SegmentStatusManager.readTableStatusFile(CarbonTablePath
+ .getTableStatusFilePath(carbonTable.getTablePath))
+ CarbonUpdateUtil.updateTableMetadataStatus(loadMetaDataDetails.map(l =>
+ new Segment(l.getMergedLoadName,

Review comment:
rename variable `l` to `loadMetadataDetails`

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -725,6 +725,45 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
checkAnswer(sql("select * from target order by key"), Seq(Row("c", "200"), Row("d", "3"), Row("e", "100")))
}

+ test("check the ccd delete with partition") {
+ sql("drop table if exists target")
+
+ val initframe = sqlContext.sparkSession.createDataFrame(Seq(
+ Row("a", "0"),
+ Row("b", "1"),
+ Row("c", "2"),
+ Row("d", "3")
+ ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", StringType))))
+
+ initframe.write
+ .format("carbondata")
+ .option("tableName", "target")
+ .option("partitionColumns", "value")
+ .mode(SaveMode.Overwrite)
+ .save()
+ val target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+ var ccd =
+ sqlContext.sparkSession.createDataFrame(Seq(
+ Row("a", null, true, 1),
+ Row("b", null, true, 2),
+ Row("c", null, true, 3),
+ Row("e", "100", false, 6)
+ ).asJava,
+ StructType(Seq(StructField("key", StringType),
+ StructField("newValue", StringType),
+ StructField("deleted", BooleanType), StructField("time", IntegerType))))
+ ccd.createOrReplaceTempView("changes")
+
+ ccd = sql("SELECT key, latest.newValue as newValue, latest.deleted as deleted FROM ( SELECT key, max(struct(time, newValue, deleted)) as latest FROM changes GROUP BY key)")
+
+ target.as("A").merge(ccd.as("B"), "A.key=B.key").
+ whenMatched("B.deleted=true").delete().execute()
+
+ assert(getDeleteDeltaFileCount("target", "0") == 0)

Review comment:
if delete operation is successful, then it should contain delete delta file, please check and add proper assert for delete delta file count

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-661748809

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3449/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-661749375

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1707/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

akashrn5 commented on a change in pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#discussion_r458017848

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -175,6 +178,21 @@ case class CarbonMergeDataSetCommand(
LOGGER.error("writing of update status file failed")
throw new CarbonMergeDataSetException("writing of update status file failed")
}
+ if (carbonTable.isHivePartitionTable) {
+ // If load count is 0 and if merge action contains delete operation, update
+ // tableUpdateStatus file name in loadMeta entry
+ if (count == 0 && hasDelAction && !tuple._1.isEmpty) {
+ val loadMetaDataDetails = SegmentStatusManager.readTableStatusFile(CarbonTablePath
+ .getTableStatusFilePath(carbonTable.getTablePath))
+ CarbonUpdateUtil.updateTableMetadataStatus(loadMetaDataDetails.map(loadMetadataDetails =>

Review comment:
```suggestion
CarbonUpdateUtil.updateTableMetadataStatus(loadMetaDataDetails.map(loadMetadataDetail =>
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

akashrn5 commented on a change in pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#discussion_r458018601

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -725,6 +725,47 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
checkAnswer(sql("select * from target order by key"), Seq(Row("c", "200"), Row("d", "3"), Row("e", "100")))
}

+ test("check the ccd delete with partition") {

Review comment:
```suggestion
test("check the cdc delete with partition") {
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

akashrn5 commented on a change in pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#discussion_r458018787

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -725,6 +725,47 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
checkAnswer(sql("select * from target order by key"), Seq(Row("c", "200"), Row("d", "3"), Row("e", "100")))
}

+ test("check the ccd delete with partition") {
+ sql("drop table if exists target")
+
+ val initframe = sqlContext.sparkSession.createDataFrame(Seq(
+ Row("a", "0"),
+ Row("a1", "0"),
+ Row("b", "1"),
+ Row("c", "2"),
+ Row("d", "3")
+ ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", StringType))))
+
+ initframe.write
+ .format("carbondata")
+ .option("tableName", "target")
+ .option("partitionColumns", "value")
+ .mode(SaveMode.Overwrite)
+ .save()
+ val target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+ var ccd =

Review comment:
```suggestion
var cdc =
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

akashrn5 commented on a change in pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#discussion_r458018898

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -725,6 +725,47 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
checkAnswer(sql("select * from target order by key"), Seq(Row("c", "200"), Row("d", "3"), Row("e", "100")))
}

+ test("check the ccd delete with partition") {
+ sql("drop table if exists target")
+
+ val initframe = sqlContext.sparkSession.createDataFrame(Seq(
+ Row("a", "0"),
+ Row("a1", "0"),
+ Row("b", "1"),
+ Row("c", "2"),
+ Row("d", "3")
+ ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", StringType))))
+
+ initframe.write
+ .format("carbondata")
+ .option("tableName", "target")
+ .option("partitionColumns", "value")
+ .mode(SaveMode.Overwrite)
+ .save()
+ val target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+ var ccd =
+ sqlContext.sparkSession.createDataFrame(Seq(
+ Row("a", null, true, 1),
+ Row("a1", null, false, 1),
+ Row("b", null, true, 2),
+ Row("c", null, true, 3),
+ Row("e", "100", false, 6)
+ ).asJava,
+ StructType(Seq(StructField("key", StringType),
+ StructField("newValue", StringType),
+ StructField("deleted", BooleanType), StructField("time", IntegerType))))
+ ccd.createOrReplaceTempView("changes")
+
+ ccd = sql("SELECT key, latest.newValue as newValue, latest.deleted as deleted FROM ( SELECT key, max(struct(time, newValue, deleted)) as latest FROM changes GROUP BY key)")

Review comment:
same as above

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

akashrn5 commented on a change in pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#discussion_r458019493

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -869,10 +910,15 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
CarbonProperties.getInstance().addProperty("carbon.enable.auto.load.merge", "false")
}

- private def getDeleteDeltaFileCount(tableName: String, segment: String): Int = {
+ private def getDeleteDeltaFileCount(tableName: String,
+ segment: String,

Review comment:
no need to change the method signature, you are already having the carbonTable object, check from it whether its partition table or not.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-662011240

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3458/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-662012100

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1716/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

akashrn5 commented on pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846#issuecomment-662235965

LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] asfgit closed pull request #3846: [CARBONDATA-3902] Fix CDC delete data issue on partition table

In reply to this post by GitBox

asfgit closed pull request #3846:
URL: https://github.com/apache/carbondata/pull/3846

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]