Login  Register

In every query, carbondata has to scan all the segment file. This may takes too much time.

Posted by areyouokfreejoe on Jul 14, 2021; 8:03am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/In-every-query-carbondata-has-to-scan-all-the-segment-file-This-may-takes-too-much-time-tp108997.html

In every query, carbondata has to scan all the segment file.
So when there is too much segments, it take too much time to get all the file info.
The customer hope comminity can solve this.
When there is no segment changed, carbondata should not scan all the segment file.
This is the stack of call:

at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1641)
at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.<init>(AbstractDFSCarbonFile.java:77)
at org.apache.carbondata.core.datastore.filesystem.HDFSCarbonFile.<init>(HDFSCarbonFile.java:44)
at org.apache.carbondata.spark.acl.filesystem.HDFSACLCarbonFile.<init>(HDFSACLCarbonFile.java:46)
at org.apache.carbondata.spark.acl.ACLFileFactory.getCarbonFile(ACLFileFactory.java:48)
at org.apache.carbondata.core.datastore.impl.FileFactory.getCarbonFile(FileFactory.java:167)
at org.apache.carbondata.core.readcommitter.TableStatusReadCommittedScope.getCommittedSegmentRefreshInfo(TableStatusReadCommittedScope.java:97)
at org.apache.carbondata.core.datamap.Segment.getSegmentRefreshInfo(Segment.java:177)
at org.apache.carbondata.core.datamap.DataMapStoreManager$TableSegmentRefresher.isRefreshNeeded(DataMapStoreManager.java:772)
at org.apache.carbondata.core.datamap.DataMapStoreManager.getSegmentsToBeRefreshed(DataMapStoreManager.java:505)
at org.apache.carbondata.core.datamap.DataMapStoreManager.refreshSegmentCacheIfRequired(DataMapStoreManager.java:519)
at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:465)
at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:199)
at org.apache.carbondata.spark.rdd.CarbonScanRDD.internalGetPartitions(CarbonScanRDD.scala:170)
at org.apache.carbondata.spark.rdd.CarbonRDD.getPartitions(CarbonRDD.scala:68)