Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] Zhangshunyu opened a new pull request #3701: [WIP] count star read index files directly when it is pure partition filter

Classic

List

45 messages Options

Options

123

GitBox

[GitHub] [carbondata] Zhangshunyu opened a new pull request #3701: [WIP] count star read index files directly when it is pure partition filter

Zhangshunyu opened a new pull request #3701: [WIP] count star read index files directly when it is pure partition filter
URL: https://github.com/apache/carbondata/pull/3701

### Why is this PR needed?

### What changes were proposed in this PR?

### Does this PR introduce any user interface change?
- No
- Yes. (please explain the change and update document)

### Is any new testcase added?
- No
- Yes

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter

CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-611464539

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/977/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-611474384

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2690/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-611538016

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/980/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-611569375

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2693/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612008206

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/994/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [WIP] count star read index files directly when it is pure partition filter
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612008440

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2707/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3701: [WIP] improve pure partition count star performance

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3701: [WIP] improve pure partition count star performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407011008

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
##########
@@ -55,34 +67,190 @@ case class CarbonCountStar(
val (job, tableInputFormat) = createCarbonInputFormat(absoluteTableIdentifier)
CarbonInputFormat.setQuerySegment(job.getConfiguration, carbonTable)

- // get row count
- var rowCount = CarbonUpdateUtil.getRowCount(
- tableInputFormat.getBlockRowCount(
- job,
- carbonTable,
- CarbonFilters.getPartitions(
- Seq.empty,
- sparkSession,
- TableIdentifier(
- carbonTable.getTableName,
- Some(carbonTable.getDatabaseName))).map(_.asJava).orNull, false),
- carbonTable)
+ val prunedPartitionPaths = new java.util.ArrayList[String]()
+ var totalRowCount: Long = 0
+ if (predicates.nonEmpty) {
+ val names = relation.catalogTable match {
+ case Some(table) => table.partitionColumnNames
+ case _ => Seq.empty
+ }
+ // Get the current partitions from table.
+ var partitions: java.util.List[PartitionSpec] = null
+ if (names.nonEmpty) {
+ val partitionSet = AttributeSet(names
+ .map(p => relation.output.find(_.name.equalsIgnoreCase(p)).get))
+ val partitionKeyFilters = CarbonToSparkAdapter
+ .getPartitionKeyFilter(partitionSet, predicates)
+ // Update the name with lower case as it is case sensitive while getting partition info.
+ val updatedPartitionFilters = partitionKeyFilters.map { exp =>
+ exp.transform {
+ case attr: AttributeReference =>
+ CarbonToSparkAdapter.createAttributeReference(
+ attr.name.toLowerCase,
+ attr.dataType,
+ attr.nullable,
+ attr.metadata,
+ attr.exprId,
+ attr.qualifier)
+ }
+ }
+ partitions =
+ CarbonFilters.getPartitions(
+ updatedPartitionFilters.toSeq,
+ SparkSession.getActiveSession.get,
+ relation.catalogTable.get.identifier).orNull.asJava
+ if (partitions != null) {
+ for (partition <- partitions.asScala) {
+ prunedPartitionPaths.add(partition.getLocation.toString)
+ }
+ val details = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+ val validSegmentPaths = details.filter(segment =>
+ ((segment.getSegmentStatus == SegmentStatus.SUCCESS) ||
+ (segment.getSegmentStatus == SegmentStatus.LOAD_PARTIAL_SUCCESS))
+ && segment.getSegmentFile != null).map(segment => segment.getSegmentFile)
+ val tableSegmentIndexes = DataMapStoreManager.getInstance().getAllSegmentIndexes(
+ carbonTable.getTableId)
+ if (!tableSegmentIndexes.isEmpty) {
+ // clear invalid cache
+ for (segmentFilePathInCache <- tableSegmentIndexes.keySet().asScala) {
+ if (!validSegmentPaths.contains(segmentFilePathInCache)) {
+ // means invalid cache
+ tableSegmentIndexes.remove(segmentFilePathInCache)
+ }
+ }
+ }
+ // init and put absent the valid cache
+ for (validSegmentPath <- validSegmentPaths) {
+ if (tableSegmentIndexes.get(validSegmentPath) == null) {
+ val segmentIndexMeta = new SegmentIndexMeta(validSegmentPath)
+ tableSegmentIndexes.put(validSegmentPath, segmentIndexMeta)
+ }
+ }
+
+ val numThreads = Math.min(Math.max(validSegmentPaths.length, 1), 10)

Review comment:
opening 10 threads in driver will impact the concurrent query performance !

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on issue #3701: [WIP] improve pure partition count star performance

In reply to this post by GitBox

ajantha-bhat commented on issue #3701: [WIP] improve pure partition count star performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612312964

@Zhangshunyu : What do you mean pure partition ? is it just normal "partition" ?
Also mention what was the bottleneck before in the description.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [WIP] improve pure partition count star performance

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [WIP] improve pure partition count star performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612336591

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1000/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [WIP] improve pure partition count star performance

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [WIP] improve pure partition count star performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612340549

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2712/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3701: [WIP] improve pure partition count star performance

In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #3701: [WIP] improve pure partition count star performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407285145

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
##########
@@ -55,34 +67,190 @@ case class CarbonCountStar(
val (job, tableInputFormat) = createCarbonInputFormat(absoluteTableIdentifier)
CarbonInputFormat.setQuerySegment(job.getConfiguration, carbonTable)

- // get row count
- var rowCount = CarbonUpdateUtil.getRowCount(
- tableInputFormat.getBlockRowCount(
- job,
- carbonTable,
- CarbonFilters.getPartitions(
- Seq.empty,
- sparkSession,
- TableIdentifier(
- carbonTable.getTableName,
- Some(carbonTable.getDatabaseName))).map(_.asJava).orNull, false),
- carbonTable)
+ val prunedPartitionPaths = new java.util.ArrayList[String]()
+ var totalRowCount: Long = 0
+ if (predicates.nonEmpty) {
+ val names = relation.catalogTable match {
+ case Some(table) => table.partitionColumnNames
+ case _ => Seq.empty
+ }
+ // Get the current partitions from table.
+ var partitions: java.util.List[PartitionSpec] = null
+ if (names.nonEmpty) {
+ val partitionSet = AttributeSet(names
+ .map(p => relation.output.find(_.name.equalsIgnoreCase(p)).get))
+ val partitionKeyFilters = CarbonToSparkAdapter
+ .getPartitionKeyFilter(partitionSet, predicates)
+ // Update the name with lower case as it is case sensitive while getting partition info.
+ val updatedPartitionFilters = partitionKeyFilters.map { exp =>
+ exp.transform {
+ case attr: AttributeReference =>
+ CarbonToSparkAdapter.createAttributeReference(
+ attr.name.toLowerCase,
+ attr.dataType,
+ attr.nullable,
+ attr.metadata,
+ attr.exprId,
+ attr.qualifier)
+ }
+ }
+ partitions =
+ CarbonFilters.getPartitions(
+ updatedPartitionFilters.toSeq,
+ SparkSession.getActiveSession.get,
+ relation.catalogTable.get.identifier).orNull.asJava
+ if (partitions != null) {
+ for (partition <- partitions.asScala) {
+ prunedPartitionPaths.add(partition.getLocation.toString)
+ }
+ val details = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+ val validSegmentPaths = details.filter(segment =>
+ ((segment.getSegmentStatus == SegmentStatus.SUCCESS) ||
+ (segment.getSegmentStatus == SegmentStatus.LOAD_PARTIAL_SUCCESS))
+ && segment.getSegmentFile != null).map(segment => segment.getSegmentFile)
+ val tableSegmentIndexes = DataMapStoreManager.getInstance().getAllSegmentIndexes(
+ carbonTable.getTableId)
+ if (!tableSegmentIndexes.isEmpty) {
+ // clear invalid cache
+ for (segmentFilePathInCache <- tableSegmentIndexes.keySet().asScala) {
+ if (!validSegmentPaths.contains(segmentFilePathInCache)) {
+ // means invalid cache
+ tableSegmentIndexes.remove(segmentFilePathInCache)
+ }
+ }
+ }
+ // init and put absent the valid cache
+ for (validSegmentPath <- validSegmentPaths) {
+ if (tableSegmentIndexes.get(validSegmentPath) == null) {
+ val segmentIndexMeta = new SegmentIndexMeta(validSegmentPath)
+ tableSegmentIndexes.put(validSegmentPath, segmentIndexMeta)
+ }
+ }
+
+ val numThreads = Math.min(Math.max(validSegmentPaths.length, 1), 10)

Review comment:
> @Zhangshunyu : What do you mean pure partition ? is it just normal "partition" ?
> Also mention what was the bottleneck before in the description.
@ajantha-bhat

We find that select count(*) for some partitons is time costly and worse than parquet, as currently the count(*) with filter whose culumns are all partition columns will load all datamaps of these partitions including block info/minmax info, but it is no need to load them ,we can just read it from valid index files directly using partition prune as the rowCount stored inside index files, and we can cache these info. For no-sort partition table, minmax is almost no using but cost time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3701: [WIP] improve pure partition count star performance

In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #3701: [WIP] improve pure partition count star performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407285145

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
##########
@@ -55,34 +67,190 @@ case class CarbonCountStar(
val (job, tableInputFormat) = createCarbonInputFormat(absoluteTableIdentifier)
CarbonInputFormat.setQuerySegment(job.getConfiguration, carbonTable)

- // get row count
- var rowCount = CarbonUpdateUtil.getRowCount(
- tableInputFormat.getBlockRowCount(
- job,
- carbonTable,
- CarbonFilters.getPartitions(
- Seq.empty,
- sparkSession,
- TableIdentifier(
- carbonTable.getTableName,
- Some(carbonTable.getDatabaseName))).map(_.asJava).orNull, false),
- carbonTable)
+ val prunedPartitionPaths = new java.util.ArrayList[String]()
+ var totalRowCount: Long = 0
+ if (predicates.nonEmpty) {
+ val names = relation.catalogTable match {
+ case Some(table) => table.partitionColumnNames
+ case _ => Seq.empty
+ }
+ // Get the current partitions from table.
+ var partitions: java.util.List[PartitionSpec] = null
+ if (names.nonEmpty) {
+ val partitionSet = AttributeSet(names
+ .map(p => relation.output.find(_.name.equalsIgnoreCase(p)).get))
+ val partitionKeyFilters = CarbonToSparkAdapter
+ .getPartitionKeyFilter(partitionSet, predicates)
+ // Update the name with lower case as it is case sensitive while getting partition info.
+ val updatedPartitionFilters = partitionKeyFilters.map { exp =>
+ exp.transform {
+ case attr: AttributeReference =>
+ CarbonToSparkAdapter.createAttributeReference(
+ attr.name.toLowerCase,
+ attr.dataType,
+ attr.nullable,
+ attr.metadata,
+ attr.exprId,
+ attr.qualifier)
+ }
+ }
+ partitions =
+ CarbonFilters.getPartitions(
+ updatedPartitionFilters.toSeq,
+ SparkSession.getActiveSession.get,
+ relation.catalogTable.get.identifier).orNull.asJava
+ if (partitions != null) {
+ for (partition <- partitions.asScala) {
+ prunedPartitionPaths.add(partition.getLocation.toString)
+ }
+ val details = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+ val validSegmentPaths = details.filter(segment =>
+ ((segment.getSegmentStatus == SegmentStatus.SUCCESS) ||
+ (segment.getSegmentStatus == SegmentStatus.LOAD_PARTIAL_SUCCESS))
+ && segment.getSegmentFile != null).map(segment => segment.getSegmentFile)
+ val tableSegmentIndexes = DataMapStoreManager.getInstance().getAllSegmentIndexes(
+ carbonTable.getTableId)
+ if (!tableSegmentIndexes.isEmpty) {
+ // clear invalid cache
+ for (segmentFilePathInCache <- tableSegmentIndexes.keySet().asScala) {
+ if (!validSegmentPaths.contains(segmentFilePathInCache)) {
+ // means invalid cache
+ tableSegmentIndexes.remove(segmentFilePathInCache)
+ }
+ }
+ }
+ // init and put absent the valid cache
+ for (validSegmentPath <- validSegmentPaths) {
+ if (tableSegmentIndexes.get(validSegmentPath) == null) {
+ val segmentIndexMeta = new SegmentIndexMeta(validSegmentPath)
+ tableSegmentIndexes.put(validSegmentPath, segmentIndexMeta)
+ }
+ }
+
+ val numThreads = Math.min(Math.max(validSegmentPaths.length, 1), 10)

Review comment:
> @Zhangshunyu : What do you mean pure partition ? is it just normal "partition" ?
> Also mention what was the bottleneck before in the description.
@ajantha-bhat

We find that select count(*) for some partitons is time costly and worse than parquet, as currently the count(*) with filter whose culumns are all partition columns will load all datamaps of these partitions including block info/minmax info, but it is no need to load them ,we can just read it from valid index files directly using partition prune as the rowCount stored inside index files, and we can cache these info. For no-sort partition table, minmax is almost no using but cost time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu commented on issue #3701: [WIP] improve pure partition count star performance

In reply to this post by GitBox

Zhangshunyu commented on issue #3701: [WIP] improve pure partition count star performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612715844

@ajantha-bhat
We find that select count() for some partitons is time costly and worse than parquet, as currently the count() with filter whose culumns are all partition columns will load all datamaps of these partitions including block info/minmax info, but it is no need to load them ,we can just read it from valid index files directly using partition prune as the rowCount stored inside index files, and we can cache these info. For no-sort partition table, minmax is almost no using but cost time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612753027

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2714/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612760627

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1002/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407349407

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
##########
@@ -108,4 +125,173 @@ case class CarbonCountStar(
CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
(job, carbonInputFormat)
}
+
+ // The detail of query flow as following for pure partition count star:
+ // Step 1. check whether it is pure partition count star by filter
+ // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+ // segment and expired segment
+ // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+ // of each segment into memory. If its index files already exist in cache, not required to
+ // read again.
+ // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+ // can prune most index files and reduce the files num.
+ // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+ // in the index_file <-> rowCount map.
+ private def getRowCountPurePartitionPrune: Long = {
+ var rowCount: Long = 0
+ val prunedPartitionPaths = new java.util.ArrayList[String]()
+ val names = relation.catalogTable match {

Review comment:
Looks like same code(Get the current partitions from table) is present in CarbonLateDecodeStrategy.pruneFilterProject. Can reuse it

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407349407

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
##########
@@ -108,4 +125,173 @@ case class CarbonCountStar(
CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
(job, carbonInputFormat)
}
+
+ // The detail of query flow as following for pure partition count star:
+ // Step 1. check whether it is pure partition count star by filter
+ // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+ // segment and expired segment
+ // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+ // of each segment into memory. If its index files already exist in cache, not required to
+ // read again.
+ // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+ // can prune most index files and reduce the files num.
+ // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+ // in the index_file <-> rowCount map.
+ private def getRowCountPurePartitionPrune: Long = {
+ var rowCount: Long = 0
+ val prunedPartitionPaths = new java.util.ArrayList[String]()
+ val names = relation.catalogTable match {

Review comment:
Looks like same code(Get the current partitions from table) is present in CarbonLateDecodeStrategy.pruneFilterProject. Can extract common code to new method and reuse it

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407351732

##########
File path: core/src/main/java/org/apache/carbondata/core/datamap/DataMapStoreManager.java
##########
@@ -130,6 +136,19 @@ private DataMapStoreManager() {
return allDataMaps;
}

+ /**
+ * It gives all segment indexes, a map contains segment file name to SegmentIndexMeta.
+ *
+ * @return
+ */
+ public Map<String, SegmentIndexMeta> getAllSegmentIndexes(String tableId) {
+ if (segmentIndexes.get(tableId) == null) {
+ Map<String, SegmentIndexMeta> segmentIndexMetaMap = new ConcurrentHashMap<>();
+ segmentIndexes.putIfAbsent(tableId, segmentIndexMetaMap);

Review comment:
Looks like each time, we are adding new map to `segmentIndexes`. where we actually add SegmentIndexMeta info to `segmentIndexes`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407353742

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
##########
@@ -108,4 +125,173 @@ case class CarbonCountStar(
CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
(job, carbonInputFormat)
}
+
+ // The detail of query flow as following for pure partition count star:
+ // Step 1. check whether it is pure partition count star by filter
+ // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+ // segment and expired segment
+ // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+ // of each segment into memory. If its index files already exist in cache, not required to
+ // read again.
+ // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+ // can prune most index files and reduce the files num.
+ // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+ // in the index_file <-> rowCount map.
+ private def getRowCountPurePartitionPrune: Long = {
+ var rowCount: Long = 0
+ val prunedPartitionPaths = new java.util.ArrayList[String]()
+ val names = relation.catalogTable match {

Review comment:
@Indhumathi27 val tableSegmentIndexes = DataMapStoreManager.getInstance().getAllSegmentIndexes(
carbonTable.getTableId)
if (!tableSegmentIndexes.isEmpty) {
// clear invalid cache
for (segmentFilePathInCache <- tableSegmentIndexes.keySet().asScala) {
if (!validSegmentPaths.contains(segmentFilePathInCache)) {
// means invalid cache
tableSegmentIndexes.remove(segmentFilePathInCache)
}
}
}
// init and put absent the valid cache
for (validSegmentPath <- validSegmentPaths) {
if (tableSegmentIndexes.get(validSegmentPath) == null) {
val segmentIndexMeta = new SegmentIndexMeta(validSegmentPath)
tableSegmentIndexes.put(validSegmentPath, segmentIndexMeta)
}
}

like this in code, we will get that object by table id, and put info into it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

123