Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] akashrn5 opened a new pull request #3856: [WIP]cdc improvement

Classic

List

29 messages Options

Options

12

GitBox

[GitHub] [carbondata] ravipesala commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

ravipesala commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-679095323

LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] QiangCai commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

In reply to this post by GitBox

QiangCai commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-679105065

why don't use row_v1 format in the streaming module?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

In reply to this post by GitBox

akashrn5 commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-680036760

> why don't use row_v1 format in the streaming module?
already avro is row format and same write interface it was using. I took the same with less code changes in merge command base code and as its more stable, later if we find better performance or other advantages we can try

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 edited a comment on pull request #3856: [CARBONDATA-3929]Improve CDC performance

In reply to this post by GitBox

akashrn5 edited a comment on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-680036760

> why don't use row_v1 format in the streaming module?

already avro is row format and same write interface it was using. I took the same with less code changes in merge command base code and as its more stable, later if we find better performance or other advantages we can try

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

In reply to this post by GitBox

akashrn5 commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-680117709

retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-680183456

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] asfgit closed pull request #3856: [CARBONDATA-3929]Improve CDC performance

In reply to this post by GitBox

asfgit closed pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3856: [CARBONDATA-3929]Improve CDC performance

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#discussion_r477095415

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -194,29 +210,32 @@ case class CarbonMergeDataSetCommand(
tuple._2.asJava)
}
}
- Some(UpdateTableModel(true, trxMgr.getLatestTrx,
- executorErrors, tuple._2, true))
+ Some(UpdateTableModel(isUpdate = true, trxMgr.getLatestTrx,
+ executorErrors, tuple._2, loadAsNewSegment = true))
} else {
None
}

- CarbonInsertIntoWithDf(
- databaseNameOp = Some(carbonTable.getDatabaseName),
+ val dataFrame = loadDF.select(tableCols.map(col): _*)
+ CarbonInsertIntoCommand(databaseNameOp = Some(carbonTable.getDatabaseName),
tableName = carbonTable.getTableName,
- options = Map("fileheader" -> header, "sort_scope" -> "nosort"),
+ options = Map("fileheader" -> header, "sort_scope" -> "no_sort"),

Review comment:
@akashrn5 : instead of fixing no_sort, would have analyzed why it was added. Originally it was using Target table sort scope only. #3764 added this. would have removed it here.
Now that you changed to no_sort, target table sort_scope is not used.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3856: [CARBONDATA-3929]Improve CDC performance

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#discussion_r477095488

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -194,29 +210,32 @@ case class CarbonMergeDataSetCommand(
tuple._2.asJava)
}
}
- Some(UpdateTableModel(true, trxMgr.getLatestTrx,
- executorErrors, tuple._2, true))
+ Some(UpdateTableModel(isUpdate = true, trxMgr.getLatestTrx,
+ executorErrors, tuple._2, loadAsNewSegment = true))
} else {
None
}

- CarbonInsertIntoWithDf(
- databaseNameOp = Some(carbonTable.getDatabaseName),
+ val dataFrame = loadDF.select(tableCols.map(col): _*)
+ CarbonInsertIntoCommand(databaseNameOp = Some(carbonTable.getDatabaseName),
tableName = carbonTable.getTableName,
- options = Map("fileheader" -> header, "sort_scope" -> "nosort"),
+ options = Map("fileheader" -> header, "sort_scope" -> "no_sort"),

Review comment:
check this.
https://github.com/apache/carbondata/pull/3901

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

12