[GitHub] [carbondata] Indhumathi27 opened a new pull request #3835: [WIP] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 opened a new pull request #3835: [WIP] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

GitBox

Indhumathi27 opened a new pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835


   
    ### Why is this PR needed?
   
   
    ### What changes were proposed in this PR?
   
       
    ### Does this PR introduce any user interface change?
    - No
    - Yes. (please explain the change and update document)
   
    ### Is any new testcase added?
    - No
    - Yes
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [WIP] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-656860644


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3354/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [WIP] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-656861252


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1613/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657220725


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1615/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657220952


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3356/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657241951


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3357/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657242838


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1616/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table when directlyWriteData2Hdfs is enabled

GitBox
In reply to this post by GitBox

akashrn5 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657399639


   @Indhumathi27
   1. I think, this is not specific to cdc, i think problem can come in case of normal partition table insert into flow
   also when enableDirectlyWriteDataToStorePath is true for writer. Please check that.
   
   2. In case of cdc , we set direct write to hdfs as false before doing insert, please check who made it enabled.
   
   if first point is true, you can change the PR description and remove cdc from there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

Indhumathi27 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657410697


   @akashrn5 Fixed and updated the PR. PLease review and merge


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on a change in pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

akashrn5 commented on a change in pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#discussion_r453499751



##########
File path: processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
##########
@@ -318,6 +335,13 @@ public void initializeWriter() throws CarbonDataWriterException {
     try {
       FileFactory.mkdirs(model.getCarbonDataDirectoryPath());
       if (enableDirectlyWriteDataToStorePath) {
+        if (model.getTableSpec().getCarbonTable().isHivePartitionTable() && model
+            .getCarbonDataDirectoryPath().endsWith("tmp")) {
+          // set carbonData file store path to partition path instead of tmp directory
+          carbonDataFileStorePath = model.getCarbonDataDirectoryPath()
+              .substring(0, model.getCarbonDataDirectoryPath().lastIndexOf("/")) + File.separator

Review comment:
       use `CarbonCommonConstans.FILE_SEPARATOR`

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -283,6 +284,13 @@ case class CarbonMergeDataSetCommand(
           val queue = new util.LinkedList[InternalRow]()
           override def hasNext: Boolean = if (!queue.isEmpty || iter.hasNext) true else {
             writer.close()
+            // revert load direct write to store path after IUD

Review comment:
       update the command as its not IUD exactly, its just an insert operation

##########
File path: processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
##########
@@ -281,12 +281,29 @@ protected void commitCurrentFile(boolean copyInCurrentThread) {
         LOGGER.error(e);
       }
     } else {
-      if (currentFileSize == 0) {
-        try {
+      try {
+        if (currentFileSize == 0) {
           handleEmptyDataFile(carbonDataFileStorePath);
-        } catch (IOException e) {
-          LOGGER.error(e);
+        } else {
+          // In case if direct write data to store path is enabled, carbonData files will be written
+          // directly to partition path. Need to add partition path and carbon data file size info
+          // to load metrics
+          if (model.getTableSpec().getCarbonTable().isHivePartitionTable() && model
+              .getCarbonDataDirectoryPath().endsWith("tmp") && carbonDataFileStorePath

Review comment:
       check for ends with ` .tmp` similar to carbonUtil APIs

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -283,6 +284,13 @@ case class CarbonMergeDataSetCommand(
           val queue = new util.LinkedList[InternalRow]()
           override def hasNext: Boolean = if (!queue.isEmpty || iter.hasNext) true else {
             writer.close()
+            // revert load direct write to store path after IUD
+            if (null == directlyWriteDataToHdfs) {

Review comment:
       no need of null check, use .getProperty(property_value, default value) , so u can check null check

##########
File path: processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
##########
@@ -440,9 +464,23 @@ protected void writeIndexFile() throws IOException, CarbonDataWriterException {
           .copyCarbonDataFileToCarbonStorePath(indexFileName, model.getCarbonDataDirectoryPath(),
               fileSizeInBytes, metrics);
       FileFactory.deleteFile(indexFileName);
+    } else if (model.getTableSpec().getCarbonTable().isHivePartitionTable() && model
+        .getCarbonDataDirectoryPath().endsWith("tmp")) {

Review comment:
       same as above

##########
File path: processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
##########
@@ -318,6 +335,13 @@ public void initializeWriter() throws CarbonDataWriterException {
     try {
       FileFactory.mkdirs(model.getCarbonDataDirectoryPath());
       if (enableDirectlyWriteDataToStorePath) {
+        if (model.getTableSpec().getCarbonTable().isHivePartitionTable() && model
+            .getCarbonDataDirectoryPath().endsWith("tmp")) {

Review comment:
       same as above,` .tmp`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

akashrn5 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657441200


   @Indhumathi27 please update the PR description, no  need to cdc, can say partition issue when direct write to hdfs enabled, and in detailed description part can list issues.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#discussion_r453544294



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -283,6 +284,13 @@ case class CarbonMergeDataSetCommand(
           val queue = new util.LinkedList[InternalRow]()
           override def hasNext: Boolean = if (!queue.isEmpty || iter.hasNext) true else {
             writer.close()
+            // revert load direct write to store path after IUD

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#discussion_r453544523



##########
File path: processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
##########
@@ -318,6 +335,13 @@ public void initializeWriter() throws CarbonDataWriterException {
     try {
       FileFactory.mkdirs(model.getCarbonDataDirectoryPath());
       if (enableDirectlyWriteDataToStorePath) {
+        if (model.getTableSpec().getCarbonTable().isHivePartitionTable() && model
+            .getCarbonDataDirectoryPath().endsWith("tmp")) {
+          // set carbonData file store path to partition path instead of tmp directory
+          carbonDataFileStorePath = model.getCarbonDataDirectoryPath()
+              .substring(0, model.getCarbonDataDirectoryPath().lastIndexOf("/")) + File.separator

Review comment:
       done

##########
File path: processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
##########
@@ -318,6 +335,13 @@ public void initializeWriter() throws CarbonDataWriterException {
     try {
       FileFactory.mkdirs(model.getCarbonDataDirectoryPath());
       if (enableDirectlyWriteDataToStorePath) {
+        if (model.getTableSpec().getCarbonTable().isHivePartitionTable() && model
+            .getCarbonDataDirectoryPath().endsWith("tmp")) {

Review comment:
       done

##########
File path: processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
##########
@@ -440,9 +464,23 @@ protected void writeIndexFile() throws IOException, CarbonDataWriterException {
           .copyCarbonDataFileToCarbonStorePath(indexFileName, model.getCarbonDataDirectoryPath(),
               fileSizeInBytes, metrics);
       FileFactory.deleteFile(indexFileName);
+    } else if (model.getTableSpec().getCarbonTable().isHivePartitionTable() && model
+        .getCarbonDataDirectoryPath().endsWith("tmp")) {

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657512354


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1630/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

Indhumathi27 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657531694


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657549525


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3372/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657605258


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3377/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657606800


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1636/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

Indhumathi27 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657607506


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3835: [CARBONDATA-3852] [CARBONDATA-3851] Fix Data update issue with SCD/CDC for partition table On spark cluster deploy mode

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3835:
URL: https://github.com/apache/carbondata/pull/3835#issuecomment-657681240


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3380/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


12