Apache CarbonData Dev Mailing List archive

In every query, carbondata has to scan all the segment file. This may takes too much time.

Classic

List

Threaded

4 messages Options

areyouokfreejoe

Jul 14, 2021; 8:03am

In every query, carbondata has to scan all the segment file. This may takes too much time.

In every query, carbondata has to scan all the segment file.
So when there is too much segments, it take too much time to get all the file info.
The customer hope comminity can solve this.
When there is no segment changed, carbondata should not scan all the segment file.
This is the stack of call:

at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1641)
at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.<init>(AbstractDFSCarbonFile.java:77)
at org.apache.carbondata.core.datastore.filesystem.HDFSCarbonFile.<init>(HDFSCarbonFile.java:44)
at org.apache.carbondata.spark.acl.filesystem.HDFSACLCarbonFile.<init>(HDFSACLCarbonFile.java:46)
at org.apache.carbondata.spark.acl.ACLFileFactory.getCarbonFile(ACLFileFactory.java:48)
at org.apache.carbondata.core.datastore.impl.FileFactory.getCarbonFile(FileFactory.java:167)
at org.apache.carbondata.core.readcommitter.TableStatusReadCommittedScope.getCommittedSegmentRefreshInfo(TableStatusReadCommittedScope.java:97)
at org.apache.carbondata.core.datamap.Segment.getSegmentRefreshInfo(Segment.java:177)
at org.apache.carbondata.core.datamap.DataMapStoreManager$TableSegmentRefresher.isRefreshNeeded(DataMapStoreManager.java:772)
at org.apache.carbondata.core.datamap.DataMapStoreManager.getSegmentsToBeRefreshed(DataMapStoreManager.java:505)
at org.apache.carbondata.core.datamap.DataMapStoreManager.refreshSegmentCacheIfRequired(DataMapStoreManager.java:519)
at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:465)
at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:199)
at org.apache.carbondata.spark.rdd.CarbonScanRDD.internalGetPartitions(CarbonScanRDD.scala:170)
at org.apache.carbondata.spark.rdd.CarbonRDD.getPartitions(CarbonRDD.scala:68)

areyouokfreejoe

Jul 14, 2021; 9:56am

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

I think there can be a file named LAST_MODIFY.
It contains the last update time of the segment file.
When carbon try to refresh the segment cache, if found that the update time in LAST_MODIFY time is the same with the cache, then there is no need to refresh all segment file.

vikramahuja1001

Jul 19, 2021; 5:06am

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

Hi,
This issue has already been fixed. The segments do not refresh from cache if the segment file name has not been updated.
Please find the solution in the following PR :https://github.com/apache/carbondata/pull/3988
Kindly check the changes in TableStatusReadCommittedScope.java class

Thanks
Vikram Ahuja

jiayi_wang

Jul 29, 2021; 2:26am

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

Hi Vikram, I wanna make a contribution to our community by working on the Spark 3.1.1 support in CarbonData. First of all, I would like to build a connection with you.
I've sent several emails to you, not response yet.
How can I communicate with you guys? Maybe a slack link?