[GitHub] [carbondata] Zhangshunyu opened a new pull request #3701: [WIP] count star read index files directly when it is pure partition filter

classic Classic list List threaded Threaded
45 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407355820
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -108,4 +125,173 @@ case class CarbonCountStar(
     CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
     (job, carbonInputFormat)
   }
+
+  // The detail of query flow as following for pure partition count star:
+  // Step 1. check whether it is pure partition count star by filter
+  // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+  // segment and expired segment
+  // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+  // of each segment into memory. If its index files already exist in cache, not required to
+  // read again.
+  // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+  // can prune most index files and reduce the files num.
+  // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+  // in the index_file <-> rowCount map.
+  private def getRowCountPurePartitionPrune: Long = {
+    var rowCount: Long = 0
+    val prunedPartitionPaths = new java.util.ArrayList[String]()
+    val names = relation.catalogTable match {
 
 Review comment:
   we are adding info to local variable `tableSegmentIndexes` only right. Not to DataMapStoreManager.segmentIndexes

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
ajantha-bhat commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612790523
 
 
   @Indhumathi27 : we are already matching partition first, before the loading min max (your old PR)
   That was done only for select * flow, not for count(*) flow ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Zhangshunyu commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612791302
 
 
   @ajantha-bhat  I mean cout star for partiton columns no need to load minmax

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Indhumathi27 commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407355820
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -108,4 +125,173 @@ case class CarbonCountStar(
     CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
     (job, carbonInputFormat)
   }
+
+  // The detail of query flow as following for pure partition count star:
+  // Step 1. check whether it is pure partition count star by filter
+  // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+  // segment and expired segment
+  // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+  // of each segment into memory. If its index files already exist in cache, not required to
+  // read again.
+  // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+  // can prune most index files and reduce the files num.
+  // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+  // in the index_file <-> rowCount map.
+  private def getRowCountPurePartitionPrune: Long = {
+    var rowCount: Long = 0
+    val prunedPartitionPaths = new java.util.ArrayList[String]()
+    val names = relation.catalogTable match {
 
 Review comment:
   okay

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Indhumathi27 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612792247
 
 
   @ajantha-bhat It was done for filter on partition columns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407362944
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -55,34 +67,190 @@ case class CarbonCountStar(
     val (job, tableInputFormat) = createCarbonInputFormat(absoluteTableIdentifier)
     CarbonInputFormat.setQuerySegment(job.getConfiguration, carbonTable)
 
-    // get row count
-    var rowCount = CarbonUpdateUtil.getRowCount(
-      tableInputFormat.getBlockRowCount(
-        job,
-        carbonTable,
-        CarbonFilters.getPartitions(
-          Seq.empty,
-          sparkSession,
-          TableIdentifier(
-            carbonTable.getTableName,
-            Some(carbonTable.getDatabaseName))).map(_.asJava).orNull, false),
-      carbonTable)
+    val prunedPartitionPaths = new java.util.ArrayList[String]()
+    var totalRowCount: Long = 0
+    if (predicates.nonEmpty) {
+      val names = relation.catalogTable match {
+        case Some(table) => table.partitionColumnNames
+        case _ => Seq.empty
+      }
+      // Get the current partitions from table.
+      var partitions: java.util.List[PartitionSpec] = null
+      if (names.nonEmpty) {
+        val partitionSet = AttributeSet(names
+          .map(p => relation.output.find(_.name.equalsIgnoreCase(p)).get))
+        val partitionKeyFilters = CarbonToSparkAdapter
+          .getPartitionKeyFilter(partitionSet, predicates)
+        // Update the name with lower case as it is case sensitive while getting partition info.
+        val updatedPartitionFilters = partitionKeyFilters.map { exp =>
+          exp.transform {
+            case attr: AttributeReference =>
+              CarbonToSparkAdapter.createAttributeReference(
+                attr.name.toLowerCase,
+                attr.dataType,
+                attr.nullable,
+                attr.metadata,
+                attr.exprId,
+                attr.qualifier)
+          }
+        }
+        partitions =
+          CarbonFilters.getPartitions(
+            updatedPartitionFilters.toSeq,
+            SparkSession.getActiveSession.get,
+            relation.catalogTable.get.identifier).orNull.asJava
+        if (partitions != null) {
+          for (partition <- partitions.asScala) {
+            prunedPartitionPaths.add(partition.getLocation.toString)
+          }
+          val details = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+          val validSegmentPaths = details.filter(segment =>
+            ((segment.getSegmentStatus == SegmentStatus.SUCCESS) ||
+              (segment.getSegmentStatus == SegmentStatus.LOAD_PARTIAL_SUCCESS))
+              && segment.getSegmentFile != null).map(segment => segment.getSegmentFile)
+          val tableSegmentIndexes = DataMapStoreManager.getInstance().getAllSegmentIndexes(
+            carbonTable.getTableId)
+          if (!tableSegmentIndexes.isEmpty) {
+            // clear invalid cache
+            for (segmentFilePathInCache <- tableSegmentIndexes.keySet().asScala) {
+              if (!validSegmentPaths.contains(segmentFilePathInCache)) {
+                // means invalid cache
+                tableSegmentIndexes.remove(segmentFilePathInCache)
+              }
+            }
+          }
+          // init and put absent the valid cache
+          for (validSegmentPath <- validSegmentPaths) {
+            if (tableSegmentIndexes.get(validSegmentPath) == null) {
+              val segmentIndexMeta = new SegmentIndexMeta(validSegmentPath)
+              tableSegmentIndexes.put(validSegmentPath, segmentIndexMeta)
+            }
+          }
+
+          val numThreads = Math.min(Math.max(validSegmentPaths.length, 1), 10)
 
 Review comment:
   @ajantha-bhat OK, will reduce core pool size

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407363141
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -108,4 +125,173 @@ case class CarbonCountStar(
     CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
     (job, carbonInputFormat)
   }
+
+  // The detail of query flow as following for pure partition count star:
+  // Step 1. check whether it is pure partition count star by filter
+  // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+  // segment and expired segment
+  // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+  // of each segment into memory. If its index files already exist in cache, not required to
+  // read again.
+  // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+  // can prune most index files and reduce the files num.
+  // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+  // in the index_file <-> rowCount map.
+  private def getRowCountPurePartitionPrune: Long = {
+    var rowCount: Long = 0
+    val prunedPartitionPaths = new java.util.ArrayList[String]()
+    val names = relation.catalogTable match {
 
 Review comment:
   @Indhumathi27 ok, will reuse the code of partition prune logic

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu removed a comment on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Zhangshunyu removed a comment on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612791302
 
 
   @ajantha-bhat  I mean cout star for partiton columns no need to load minmax

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 edited a comment on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Indhumathi27 edited a comment on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612792247
 
 
   @ajantha-bhat It was done for filter on partition columns., where we will load index for  matched partiitons

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 edited a comment on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Indhumathi27 edited a comment on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612792247
 
 
   @ajantha-bhat It was done for select with filter on partition columns., where we will load index for  matched partiitons

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407403964
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -108,4 +125,145 @@ case class CarbonCountStar(
     CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
     (job, carbonInputFormat)
   }
+
+  // The detail of query flow as following for pure partition count star:
+  // Step 1. check whether it is pure partition count star by filter
+  // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+  // segment and expired segment
+  // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+  // of each segment into memory. If its index files already exist in cache, not required to
+  // read again.
+  // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+  // can prune most index files and reduce the files num.
+  // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+  // in the index_file <-> rowCount map.
+  private def getRowCountPurePartitionPrune: Long = {
+    var rowCount: Long = 0
+    val prunedPartitionPaths = new java.util.ArrayList[String]()
+    // Get the current partitions from table.
+    val partitions = CarbonFilters.getPrunedPartitions(relation, predicates)
+    if (partitions != null) {
+      for (partition <- partitions) {
+        prunedPartitionPaths.add(partition.getLocation.toString)
+      }
+      val details = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+      val validSegmentPaths = details.filter(segment =>
+        ((segment.getSegmentStatus == SegmentStatus.SUCCESS) ||
+          (segment.getSegmentStatus == SegmentStatus.LOAD_PARTIAL_SUCCESS))
+          && segment.getSegmentFile != null).map(segment => segment.getSegmentFile)
+      val tableSegmentIndexes = DataMapStoreManager.getInstance().getAllSegmentIndexes(
+        carbonTable.getTableId)
+      if (!tableSegmentIndexes.isEmpty) {
+        // clear invalid cache
+        for (segmentFilePathInCache <- tableSegmentIndexes.keySet().asScala) {
+          if (!validSegmentPaths.contains(segmentFilePathInCache)) {
+            // means invalid cache
+            tableSegmentIndexes.remove(segmentFilePathInCache)
+          }
+        }
+      }
+      // init and put absent the valid cache
+      for (validSegmentPath <- validSegmentPaths) {
+        if (tableSegmentIndexes.get(validSegmentPath) == null) {
+          val segmentIndexMeta = new SegmentIndexMeta(validSegmentPath)
+          tableSegmentIndexes.put(validSegmentPath, segmentIndexMeta)
+        }
+      }
+
+      val numThreads = Math.min(Math.max(validSegmentPaths.length, 1), 4)
 
 Review comment:
   In case of many segments, isn't it better to have a spark job and get it done quickly by executors ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407406102
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -108,4 +125,145 @@ case class CarbonCountStar(
     CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
     (job, carbonInputFormat)
   }
+
+  // The detail of query flow as following for pure partition count star:
+  // Step 1. check whether it is pure partition count star by filter
+  // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+  // segment and expired segment
+  // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+  // of each segment into memory. If its index files already exist in cache, not required to
+  // read again.
+  // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+  // can prune most index files and reduce the files num.
+  // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+  // in the index_file <-> rowCount map.
+  private def getRowCountPurePartitionPrune: Long = {
+    var rowCount: Long = 0
+    val prunedPartitionPaths = new java.util.ArrayList[String]()
+    // Get the current partitions from table.
+    val partitions = CarbonFilters.getPrunedPartitions(relation, predicates)
+    if (partitions != null) {
+      for (partition <- partitions) {
+        prunedPartitionPaths.add(partition.getLocation.toString)
+      }
+      val details = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+      val validSegmentPaths = details.filter(segment =>
+        ((segment.getSegmentStatus == SegmentStatus.SUCCESS) ||
+          (segment.getSegmentStatus == SegmentStatus.LOAD_PARTIAL_SUCCESS))
+          && segment.getSegmentFile != null).map(segment => segment.getSegmentFile)
+      val tableSegmentIndexes = DataMapStoreManager.getInstance().getAllSegmentIndexes(
+        carbonTable.getTableId)
+      if (!tableSegmentIndexes.isEmpty) {
+        // clear invalid cache
+        for (segmentFilePathInCache <- tableSegmentIndexes.keySet().asScala) {
+          if (!validSegmentPaths.contains(segmentFilePathInCache)) {
+            // means invalid cache
+            tableSegmentIndexes.remove(segmentFilePathInCache)
+          }
+        }
+      }
+      // init and put absent the valid cache
+      for (validSegmentPath <- validSegmentPaths) {
+        if (tableSegmentIndexes.get(validSegmentPath) == null) {
+          val segmentIndexMeta = new SegmentIndexMeta(validSegmentPath)
+          tableSegmentIndexes.put(validSegmentPath, segmentIndexMeta)
+        }
+      }
+
+      val numThreads = Math.min(Math.max(validSegmentPaths.length, 1), 4)
 
 Review comment:
   @ajantha-bhat we also need to cache the info of each seg and index files each seg, index-count map in driver sametime

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407407283
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -55,34 +67,39 @@ case class CarbonCountStar(
     val (job, tableInputFormat) = createCarbonInputFormat(absoluteTableIdentifier)
     CarbonInputFormat.setQuerySegment(job.getConfiguration, carbonTable)
 
-    // get row count
-    var rowCount = CarbonUpdateUtil.getRowCount(
-      tableInputFormat.getBlockRowCount(
-        job,
-        carbonTable,
-        CarbonFilters.getPartitions(
-          Seq.empty,
-          sparkSession,
-          TableIdentifier(
-            carbonTable.getTableName,
-            Some(carbonTable.getDatabaseName))).map(_.asJava).orNull, false),
-      carbonTable)
+    var totalRowCount: Long = 0
+    if (predicates.nonEmpty) {
 
 Review comment:
   what if it is non partition table ?
   a. we need to check is it partition table also ?
   b. what if predicate is not partitioned column ? we need to have that check also ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407408215
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -55,34 +67,39 @@ case class CarbonCountStar(
     val (job, tableInputFormat) = createCarbonInputFormat(absoluteTableIdentifier)
     CarbonInputFormat.setQuerySegment(job.getConfiguration, carbonTable)
 
-    // get row count
-    var rowCount = CarbonUpdateUtil.getRowCount(
-      tableInputFormat.getBlockRowCount(
-        job,
-        carbonTable,
-        CarbonFilters.getPartitions(
-          Seq.empty,
-          sparkSession,
-          TableIdentifier(
-            carbonTable.getTableName,
-            Some(carbonTable.getDatabaseName))).map(_.asJava).orNull, false),
-      carbonTable)
+    var totalRowCount: Long = 0
+    if (predicates.nonEmpty) {
 
 Review comment:
   @ajantha-bhat pls check the mothd  isPurePartitionPrune in this pr...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407408617
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -55,34 +67,39 @@ case class CarbonCountStar(
     val (job, tableInputFormat) = createCarbonInputFormat(absoluteTableIdentifier)
     CarbonInputFormat.setQuerySegment(job.getConfiguration, carbonTable)
 
-    // get row count
-    var rowCount = CarbonUpdateUtil.getRowCount(
-      tableInputFormat.getBlockRowCount(
-        job,
-        carbonTable,
-        CarbonFilters.getPartitions(
-          Seq.empty,
-          sparkSession,
-          TableIdentifier(
-            carbonTable.getTableName,
-            Some(carbonTable.getDatabaseName))).map(_.asJava).orNull, false),
-      carbonTable)
+    var totalRowCount: Long = 0
+    if (predicates.nonEmpty) {
 
 Review comment:
   ok

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#discussion_r407411990
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonCountStar.scala
 ##########
 @@ -108,4 +125,145 @@ case class CarbonCountStar(
     CarbonInputFormatUtil.setDataMapJobIfConfigured(job.getConfiguration)
     (job, carbonInputFormat)
   }
+
+  // The detail of query flow as following for pure partition count star:
+  // Step 1. check whether it is pure partition count star by filter
+  // Step 2. read tablestatus to get all valid segments, remove the segment file cache of invalid
+  // segment and expired segment
+  // Step 3. use multi-thread to read segment files which not in cache and cache index files list
+  // of each segment into memory. If its index files already exist in cache, not required to
+  // read again.
+  // Step 4. use multi-thread to prune segment and partition to get pruned index file list, which
+  // can prune most index files and reduce the files num.
+  // Step 5. read the count from pruned index file directly and cache it, get from cache if exist
+  // in the index_file <-> rowCount map.
+  private def getRowCountPurePartitionPrune: Long = {
+    var rowCount: Long = 0
+    val prunedPartitionPaths = new java.util.ArrayList[String]()
+    // Get the current partitions from table.
+    val partitions = CarbonFilters.getPrunedPartitions(relation, predicates)
+    if (partitions != null) {
+      for (partition <- partitions) {
+        prunedPartitionPaths.add(partition.getLocation.toString)
+      }
+      val details = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+      val validSegmentPaths = details.filter(segment =>
+        ((segment.getSegmentStatus == SegmentStatus.SUCCESS) ||
+          (segment.getSegmentStatus == SegmentStatus.LOAD_PARTIAL_SUCCESS))
+          && segment.getSegmentFile != null).map(segment => segment.getSegmentFile)
+      val tableSegmentIndexes = DataMapStoreManager.getInstance().getAllSegmentIndexes(
+        carbonTable.getTableId)
+      if (!tableSegmentIndexes.isEmpty) {
+        // clear invalid cache
+        for (segmentFilePathInCache <- tableSegmentIndexes.keySet().asScala) {
+          if (!validSegmentPaths.contains(segmentFilePathInCache)) {
+            // means invalid cache
+            tableSegmentIndexes.remove(segmentFilePathInCache)
+          }
+        }
+      }
+      // init and put absent the valid cache
+      for (validSegmentPath <- validSegmentPaths) {
+        if (tableSegmentIndexes.get(validSegmentPath) == null) {
+          val segmentIndexMeta = new SegmentIndexMeta(validSegmentPath)
+          tableSegmentIndexes.put(validSegmentPath, segmentIndexMeta)
+        }
+      }
+
+      val numThreads = Math.min(Math.max(validSegmentPaths.length, 1), 4)
 
 Review comment:
   Already we pass partition info to prune method to load min max of only matched partition #3568.
   I can understand that you want to avoid loading min max also as it is just the count(*).
   
   But just for that keeping a new cache (tableSegmentIndexes) and having multithread in driver is not efficient as memory usage in driver will be more and concurrent query will be impacted due to multithread.
   
   I still feel, count(*) loading min max is ok as it helps other queries. Let's see other people opinion
   @kunal642 , @QiangCai @jackylk
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612859476
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2722/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-612863567
 
 
   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1010/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-613907419
 
 
   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1033/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3701: [CARBONDATA-3770] Improve partition count star query performance
URL: https://github.com/apache/carbondata/pull/3701#issuecomment-613927617
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2746/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
123