[GitHub] [carbondata] QiangCai opened a new pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai opened a new pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox

QiangCai opened a new pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757


    ### Why is this PR needed?
    data load jobs are missing output metrics. please check detail in jira: CARBONDATA-3812
   
    ### What changes were proposed in this PR?
    1. re-factory OutputFilesInfoHolder to DataLoadMetrics
    2. add metrics: numOutputBytes and numOutputRows
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - No
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox

CarbonDataQA1 commented on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626102156


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1263/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626102265


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2981/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626116304


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1264/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626116407


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2982/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626134751


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2983/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626134915


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1265/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#discussion_r422476618



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/NewCarbonDataLoadRDD.scala
##########
@@ -316,6 +319,7 @@ class NewDataFrameLoaderRDD[K, V](
             carbonLoadModel.getTableName,
             carbonLoadModel.getSegment.getSegmentNo))
         executor.execute(model, loader.storeLocation, recordReaders.toArray)
+        executor.close()

Review comment:
       good catch.
   
   But better to add it inside taskCompletion listener. refer `UpdateDataLoad.scala` line 70




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#discussion_r422476654



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/NewCarbonDataLoadRDD.scala
##########
@@ -160,6 +161,7 @@ class NewCarbonDataLoadRDD[K, V](
         executor.execute(model,
           loader.storeLocation,
           recordReaders)
+        executor.close()

Review comment:
       good catch.
   
   But better to add it inside taskCompletion listener. refer UpdateDataLoad.scala line 70




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#discussion_r422476774



##########
File path: core/src/main/java/org/apache/carbondata/core/util/DataLoadMetrics.java
##########
@@ -21,10 +21,10 @@
 import java.util.ArrayList;
 import java.util.List;
 
-public class OutputFilesInfoHolder implements Serializable {
-
-  private static final long serialVersionUID = -1401375818456585241L;
-
+/**
+ * store data loading metrics
+ */
+public class DataLoadMetrics implements Serializable {

Review comment:
       I didn't call it metrics initially because it has fileNames, partition path and all.
   You think metrics is more suitable ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on a change in pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

QiangCai commented on a change in pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#discussion_r422760244



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/NewCarbonDataLoadRDD.scala
##########
@@ -316,6 +319,7 @@ class NewDataFrameLoaderRDD[K, V](
             carbonLoadModel.getTableName,
             carbonLoadModel.getSegment.getSegmentNo))
         executor.execute(model, loader.storeLocation, recordReaders.toArray)
+        executor.close()

Review comment:
       it already added, but we need to invoke it before upload metrics.
   

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/NewCarbonDataLoadRDD.scala
##########
@@ -160,6 +161,7 @@ class NewCarbonDataLoadRDD[K, V](
         executor.execute(model,
           loader.storeLocation,
           recordReaders)
+        executor.close()

Review comment:
       it already added, but we need to invoke it before upload metrics.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on a change in pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

QiangCai commented on a change in pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#discussion_r422761061



##########
File path: core/src/main/java/org/apache/carbondata/core/util/DataLoadMetrics.java
##########
@@ -21,10 +21,10 @@
 import java.util.ArrayList;
 import java.util.List;
 
-public class OutputFilesInfoHolder implements Serializable {
-
-  private static final long serialVersionUID = -1401375818456585241L;
-
+/**
+ * store data loading metrics
+ */
+public class DataLoadMetrics implements Serializable {

Review comment:
       yes. For Hadoop framework, we collect them and put them to the task message; For Spark framework, we collect them and put them to the task metrics.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626470360


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat removed a comment on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

ajantha-bhat removed a comment on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626470360


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #3757: [CARBONDATA-3812] Set output metrics for data load spark job

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #3757:
URL: https://github.com/apache/carbondata/pull/3757#issuecomment-626471483


   ok. LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]