In every query, carbondata has to scan all the segment file. This may takes too much time.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

In every query, carbondata has to scan all the segment file. This may takes too much time.

areyouokfreejoe
In every query, carbondata has to scan all the segment file.
So when there is too much segments, it take too much time to get all the file info.
The customer hope comminity can solve this.
When there is no segment changed, carbondata should not scan all the segment file.
This is the stack of call:

at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1641)
at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.<init>(AbstractDFSCarbonFile.java:77)
at org.apache.carbondata.core.datastore.filesystem.HDFSCarbonFile.<init>(HDFSCarbonFile.java:44)
at org.apache.carbondata.spark.acl.filesystem.HDFSACLCarbonFile.<init>(HDFSACLCarbonFile.java:46)
at org.apache.carbondata.spark.acl.ACLFileFactory.getCarbonFile(ACLFileFactory.java:48)
at org.apache.carbondata.core.datastore.impl.FileFactory.getCarbonFile(FileFactory.java:167)
at org.apache.carbondata.core.readcommitter.TableStatusReadCommittedScope.getCommittedSegmentRefreshInfo(TableStatusReadCommittedScope.java:97)
at org.apache.carbondata.core.datamap.Segment.getSegmentRefreshInfo(Segment.java:177)
at org.apache.carbondata.core.datamap.DataMapStoreManager$TableSegmentRefresher.isRefreshNeeded(DataMapStoreManager.java:772)
at org.apache.carbondata.core.datamap.DataMapStoreManager.getSegmentsToBeRefreshed(DataMapStoreManager.java:505)
at org.apache.carbondata.core.datamap.DataMapStoreManager.refreshSegmentCacheIfRequired(DataMapStoreManager.java:519)
at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:465)
at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:199)
at org.apache.carbondata.spark.rdd.CarbonScanRDD.internalGetPartitions(CarbonScanRDD.scala:170)
at org.apache.carbondata.spark.rdd.CarbonRDD.getPartitions(CarbonRDD.scala:68)
Reply | Threaded
Open this post in threaded view
|

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

areyouokfreejoe
I think there can be a file named LAST_MODIFY.
It contains the last update time of the segment file.
When carbon try to refresh the segment cache, if found that the update time in LAST_MODIFY time is the same with the cache, then there is no need to refresh all segment file.
Reply | Threaded
Open this post in threaded view
|

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

vikramahuja1001
Hi,
This issue has already been fixed. The segments do not refresh from cache if the segment file name has not been updated.
Please find the solution in the following PR :https://github.com/apache/carbondata/pull/3988
Kindly check the changes in TableStatusReadCommittedScope.java class

Thanks
Vikram Ahuja
Reply | Threaded
Open this post in threaded view
|

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

jiayi_wang
Hi Vikram, I wanna make a contribution to our community by working on the Spark 3.1.1 support in CarbonData. First of all, I would like to build a connection with you.
I've sent several emails to you, not response yet.
How can I communicate with you guys? Maybe a slack link?