[GitHub] [carbondata] VenuReddy2103 opened a new pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] VenuReddy2103 opened a new pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox

VenuReddy2103 opened a new pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010


    ### Why is this PR needed?
    In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata files in the segment, loop through all the carbon files and make a map of fileNameToMetaInfoMapping<path-string, BlockMetaInfo>
   
         In that carbon files loop, if the file is of AbstractDFSCarbonFile type, we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes ~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. So overall driver side query processing time has increased significantly when there are more carbon files. Hence caused TPC-DS queries performance degradation.
   
    ### What changes were proposed in this PR?
   Avoided redundant RPC calls to get file status in getAbsolutePath(), getSize() and getLocations() methods when CarbonFile is instantiated with fileStatus constructor
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - No
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] VenuReddy2103 commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox

VenuReddy2103 commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-730136701


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

jackylk commented on a change in pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#discussion_r526620141



##########
File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
##########
@@ -541,7 +547,10 @@ public boolean createNewLockFile() throws IOException {
   @Override
   public String[] getLocations() throws IOException {
     BlockLocation[] blkLocations;
-    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    FileStatus fileStatus = this.fileStatus;

Review comment:
       How about the ` fileSystem.getFileStatus(path);` in line 533, it can be changed also




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

jackylk commented on a change in pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#discussion_r526620493



##########
File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
##########
@@ -541,7 +547,10 @@ public boolean createNewLockFile() throws IOException {
   @Override
   public String[] getLocations() throws IOException {
     BlockLocation[] blkLocations;
-    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    FileStatus fileStatus = this.fileStatus;
+    if (fileStatus == null) {
+      fileStatus = fileSystem.getFileStatus(path);

Review comment:
       one file status is got, we can cache it in the class field member




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

VenuReddy2103 commented on a change in pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#discussion_r526656764



##########
File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
##########
@@ -541,7 +547,10 @@ public boolean createNewLockFile() throws IOException {
   @Override
   public String[] getLocations() throws IOException {
     BlockLocation[] blkLocations;
-    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    FileStatus fileStatus = this.fileStatus;
+    if (fileStatus == null) {
+      fileStatus = fileSystem.getFileStatus(path);

Review comment:
       I think, as mentioned in PR #2465, we shouldn't get once and cache the filestatus in class field member. So have fixed to use the existing fileStatus from class field only if available (i.e., when Carbon file is instantiated with a `FileStatus` constructor `AbstractDFSCarbonFile(FileStatus fileStatus)`). This is particularly the case where we have obtained all the filestatuses in the segment and create carbonfiles with filestatus argument constructor. It happens only from implementation of
   `protected abstract CarbonFile[] getFiles(FileStatus[] listStatus);`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

VenuReddy2103 commented on a change in pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#discussion_r526656764



##########
File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
##########
@@ -541,7 +547,10 @@ public boolean createNewLockFile() throws IOException {
   @Override
   public String[] getLocations() throws IOException {
     BlockLocation[] blkLocations;
-    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    FileStatus fileStatus = this.fileStatus;
+    if (fileStatus == null) {
+      fileStatus = fileSystem.getFileStatus(path);

Review comment:
       I think, as mentioned in PR #2465, we shouldn't get once and cache the filestatus in class field member. So have fixed to use the existing fileStatus from class field only if available (i.e., when Carbon file is instantiated with a `FileStatus` constructor `AbstractDFSCarbonFile(FileStatus fileStatus)`). This is particularly the case where we have obtained all the filestatuses in the segment and create carbonfiles with filestatus argument constructor. It happens only from implementation of `protected abstract CarbonFile[] getFiles(FileStatus[] listStatus);`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-730514567


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-731244157


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-731302363


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3079/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-731306846


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4835/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

VenuReddy2103 commented on a change in pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#discussion_r529192088



##########
File path: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java
##########
@@ -541,7 +547,10 @@ public boolean createNewLockFile() throws IOException {
   @Override
   public String[] getLocations() throws IOException {
     BlockLocation[] blkLocations;
-    FileStatus fileStatus = fileSystem.getFileStatus(path);
+    FileStatus fileStatus = this.fileStatus;

Review comment:
       Please refer to reply in below comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kunal642 commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

kunal642 commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-737215345


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-737277105


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/5028/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-737283454


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3272/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] kunal642 commented on pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

kunal642 commented on pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010#issuecomment-737674089


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] asfgit closed pull request #4010: [CARBONDATA-4050]Avoid redundant RPC calls to get file status when CarbonFile is instantiated with fileStatus constructor

GitBox
In reply to this post by GitBox

asfgit closed pull request #4010:
URL: https://github.com/apache/carbondata/pull/4010


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]