In every query, carbondata has to scan all the segment file. This may takes too much time.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view

In every query, carbondata has to scan all the segment file. This may takes too much time.

In every query, carbondata has to scan all the segment file.
So when there is too much segments, it take too much time to get all the file info.
The customer hope comminity can solve this.
When there is no segment changed, carbondata should not scan all the segment file.
This is the stack of call:

at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(
at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.<init>(
at org.apache.carbondata.core.datastore.filesystem.HDFSCarbonFile.<init>(
at org.apache.carbondata.spark.acl.filesystem.HDFSACLCarbonFile.<init>(
at org.apache.carbondata.spark.acl.ACLFileFactory.getCarbonFile(
at org.apache.carbondata.core.datastore.impl.FileFactory.getCarbonFile(
at org.apache.carbondata.core.readcommitter.TableStatusReadCommittedScope.getCommittedSegmentRefreshInfo(
at org.apache.carbondata.core.datamap.Segment.getSegmentRefreshInfo(
at org.apache.carbondata.core.datamap.DataMapStoreManager$TableSegmentRefresher.isRefreshNeeded(
at org.apache.carbondata.core.datamap.DataMapStoreManager.getSegmentsToBeRefreshed(
at org.apache.carbondata.core.datamap.DataMapStoreManager.refreshSegmentCacheIfRequired(
at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(
at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(
at org.apache.carbondata.spark.rdd.CarbonScanRDD.internalGetPartitions(CarbonScanRDD.scala:170)
at org.apache.carbondata.spark.rdd.CarbonRDD.getPartitions(CarbonRDD.scala:68)
Reply | Threaded
Open this post in threaded view

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

I think there can be a file named LAST_MODIFY.
It contains the last update time of the segment file.
When carbon try to refresh the segment cache, if found that the update time in LAST_MODIFY time is the same with the cache, then there is no need to refresh all segment file.
Reply | Threaded
Open this post in threaded view

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

This issue has already been fixed. The segments do not refresh from cache if the segment file name has not been updated.
Please find the solution in the following PR :
Kindly check the changes in class

Vikram Ahuja
Reply | Threaded
Open this post in threaded view

Re: In every query, carbondata has to scan all the segment file. This may takes too much time.

Hi Vikram, I wanna make a contribution to our community by working on the Spark 3.1.1 support in CarbonData. First of all, I would like to build a connection with you.
I've sent several emails to you, not response yet.
How can I communicate with you guys? Maybe a slack link?