[GitHub] [carbondata] ajantha-bhat opened a new pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat opened a new pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
ajantha-bhat opened a new pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694
 
 
    ### Why is this PR needed?
   For insertStageCommand, spark is reusing the internalRow as two times we transform from RDD[InternalRow] -> dataframe -> logical Plan -> RDD[InternalRow]. So, same data is inserted on other rows
   
    ### What changes were proposed in this PR?
   Copy the internalRow after the last transform.
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - Yes (added validation)
   
     

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
ajantha-bhat commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-608491472
 
 
   @marchpure @QiangCai : please check

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-608528338
 
 
   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/922/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-608528420
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2631/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-608599414
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2634/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-608603673
 
 
   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/925/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on a change in pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
QiangCai commented on a change in pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#discussion_r405359588
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CommonLoadUtils.scala
 ##########
 @@ -767,7 +773,16 @@ object CommonLoadUtils {
             internalRow.getInt(index) + DateDirectDictionaryGenerator.cutOffDate)
         }
       }
-      internalRow
+      if (isInsertFromStageCommand) {
+        // Insert stage command, logical plan already consist of LogicalRDD of internalRow.
+        // When it is converted to DataFrame, spark is reusing the same internalRow.
+        // So, need to have a copy before the last transformation.
+        // TODO: Even though copying internalRow is faster, we should avoid it
+        //  by finding a better way
+        internalRow.copy()
+      } else {
+        internalRow
+      }
 
 Review comment:
   How about moving code block to line 750?
   Because this change faced NullPointerException in some scenarios.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#discussion_r405363099
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CommonLoadUtils.scala
 ##########
 @@ -767,7 +773,16 @@ object CommonLoadUtils {
             internalRow.getInt(index) + DateDirectDictionaryGenerator.cutOffDate)
         }
       }
-      internalRow
+      if (isInsertFromStageCommand) {
+        // Insert stage command, logical plan already consist of LogicalRDD of internalRow.
+        // When it is converted to DataFrame, spark is reusing the same internalRow.
+        // So, need to have a copy before the last transformation.
+        // TODO: Even though copying internalRow is faster, we should avoid it
+        //  by finding a better way
+        internalRow.copy()
+      } else {
+        internalRow
+      }
 
 Review comment:
   @QiangCai: CI all test case run. when we can get NPE ?
   Shall I just add null check in the same line 782 ?  Becuase moving to line 750, how it can avoid NPE ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#discussion_r405363099
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CommonLoadUtils.scala
 ##########
 @@ -767,7 +773,16 @@ object CommonLoadUtils {
             internalRow.getInt(index) + DateDirectDictionaryGenerator.cutOffDate)
         }
       }
-      internalRow
+      if (isInsertFromStageCommand) {
+        // Insert stage command, logical plan already consist of LogicalRDD of internalRow.
+        // When it is converted to DataFrame, spark is reusing the same internalRow.
+        // So, need to have a copy before the last transformation.
+        // TODO: Even though copying internalRow is faster, we should avoid it
+        //  by finding a better way
+        internalRow.copy()
+      } else {
+        internalRow
+      }
 
 Review comment:
   @QiangCai: CI all test case run. when we can get NPE ?
   Shall I just add null check in the same line 782 ?  Becuase moving to line 750, how it can avoid NPE ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#discussion_r405381954
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CommonLoadUtils.scala
 ##########
 @@ -767,7 +773,16 @@ object CommonLoadUtils {
             internalRow.getInt(index) + DateDirectDictionaryGenerator.cutOffDate)
         }
       }
-      internalRow
+      if (isInsertFromStageCommand) {
+        // Insert stage command, logical plan already consist of LogicalRDD of internalRow.
+        // When it is converted to DataFrame, spark is reusing the same internalRow.
+        // So, need to have a copy before the last transformation.
+        // TODO: Even though copying internalRow is faster, we should avoid it
+        //  by finding a better way
+        internalRow.copy()
+      } else {
+        internalRow
+      }
 
 Review comment:
   ok. Moved.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-610929848
 
 
   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/971/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-610936219
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2683/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
ajantha-bhat commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-611339438
 
 
   @QiangCai : PR is ready. Please check

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
QiangCai commented on issue #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694#issuecomment-611367589
 
 
   LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] asfgit closed pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command

GitBox
In reply to this post by GitBox
asfgit closed pull request #3694: [CARBONDATA-3763] Fix wrong insert result during insert stage command
URL: https://github.com/apache/carbondata/pull/3694
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services