Posted by
GitBox on
Nov 18, 2020; 7:07pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/GitHub-carbondata-VenuReddy2103-opened-a-new-pull-request-4010-CARBONDATA-4050-Avoid-redundant-RPC-cr-tp103268.html
VenuReddy2103 opened a new pull request #4010:
URL:
https://github.com/apache/carbondata/pull/4010 ### Why is this PR needed?
In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata files in the segment, loop through all the carbon files and make a map of fileNameToMetaInfoMapping<path-string, BlockMetaInfo>
In that carbon files loop, if the file is of AbstractDFSCarbonFile type, we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes ~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. So overall driver side query processing time has increased significantly when there are more carbon files. Hence caused TPC-DS queries performance degradation.
### What changes were proposed in this PR?
Avoided redundant RPC calls to get file status in getAbsolutePath(), getSize() and getLocations() methods when CarbonFile is instantiated with fileStatus constructor
### Does this PR introduce any user interface change?
- No
### Is any new testcase added?
- No
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[hidden email]