[GitHub] [carbondata] akashrn5 opened a new pull request #3856: [WIP]cdc improvement

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ravipesala commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox

ravipesala commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-679095323


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox
In reply to this post by GitBox

QiangCai commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-679105065


   why don't use row_v1 format in the streaming module?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox
In reply to this post by GitBox

akashrn5 commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-680036760


   > why don't use row_v1 format in the streaming module?
   already avro is row format and same write interface it was using. I took the same with less code changes in merge command base code and as its more stable, later if we find better performance or other advantages we can try
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 edited a comment on pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox
In reply to this post by GitBox

akashrn5 edited a comment on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-680036760


   > why don't use row_v1 format in the streaming module?
   
   already avro is row format and same write interface it was using. I took the same with less code changes in merge command base code and as its more stable, later if we find better performance or other advantages we can try
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox
In reply to this post by GitBox

akashrn5 commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-680117709


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#issuecomment-680183456






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] asfgit closed pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox
In reply to this post by GitBox

asfgit closed pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#discussion_r477095415



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -194,29 +210,32 @@ case class CarbonMergeDataSetCommand(
             tuple._2.asJava)
         }
       }
-      Some(UpdateTableModel(true, trxMgr.getLatestTrx,
-        executorErrors, tuple._2, true))
+      Some(UpdateTableModel(isUpdate = true, trxMgr.getLatestTrx,
+        executorErrors, tuple._2, loadAsNewSegment = true))
     } else {
       None
     }
 
-    CarbonInsertIntoWithDf(
-      databaseNameOp = Some(carbonTable.getDatabaseName),
+    val dataFrame = loadDF.select(tableCols.map(col): _*)
+    CarbonInsertIntoCommand(databaseNameOp = Some(carbonTable.getDatabaseName),
       tableName = carbonTable.getTableName,
-      options = Map("fileheader" -> header, "sort_scope" -> "nosort"),
+      options = Map("fileheader" -> header, "sort_scope" -> "no_sort"),

Review comment:
       @akashrn5 : instead of fixing no_sort, would have analyzed why it was added. Originally it was using Target table sort scope only. #3764 added this. would have removed it here.
   Now that you changed to no_sort, target table sort_scope is not used.
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3856: [CARBONDATA-3929]Improve CDC performance

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3856:
URL: https://github.com/apache/carbondata/pull/3856#discussion_r477095488



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -194,29 +210,32 @@ case class CarbonMergeDataSetCommand(
             tuple._2.asJava)
         }
       }
-      Some(UpdateTableModel(true, trxMgr.getLatestTrx,
-        executorErrors, tuple._2, true))
+      Some(UpdateTableModel(isUpdate = true, trxMgr.getLatestTrx,
+        executorErrors, tuple._2, loadAsNewSegment = true))
     } else {
       None
     }
 
-    CarbonInsertIntoWithDf(
-      databaseNameOp = Some(carbonTable.getDatabaseName),
+    val dataFrame = loadDF.select(tableCols.map(col): _*)
+    CarbonInsertIntoCommand(databaseNameOp = Some(carbonTable.getDatabaseName),
       tableName = carbonTable.getTableName,
-      options = Map("fileheader" -> header, "sort_scope" -> "nosort"),
+      options = Map("fileheader" -> header, "sort_scope" -> "no_sort"),

Review comment:
       check this.
   https://github.com/apache/carbondata/pull/3901




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


12