Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] Indhumathi27 opened a new pull request #3584: [WIP] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

Classic

List

209 messages Options

Options

1 ... 4567891011

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389634416

##########
File path: core/src/main/java/org/apache/carbondata/core/util/SegmentBlockMinMaxInfo.java
##########
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.Serializable;
+
+/**
+ * Represent min, max and alter sort column properties for each column in a block
+ */
+public class SegmentBlockMinMaxInfo implements Serializable {

Review comment:
keep it as just `BlockColumnMetadataInfo`, because it is just info for each column in the block. no need of minmax in the name because it has other info also.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389618409

##########
File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java
##########
@@ -186,6 +217,110 @@ private void getTableBlockUniqueIdentifierWrappers(List<PartitionSpec> partition
}
}

+ /**
+ * Using blockLevel minmax values, identify if segment has to be added for further pruning and to
+ * load segment index info to cache
+ * @param segment to be identified if needed for loading block datamaps
+ * @param segmentMinMaxList list of block level min max values
+ * @param filter filter expression
+ * @param identifiers tableBlockIndexUniqueIdentifiers
+ * @param tableBlockIndexUniqueIdentifierWrappers to add tableBlockIndexUniqueIdentifiers
+ */
+ private void getTableBlockIndexUniqueIdentifierUsingSegmentMinMax(Segment segment,
+ List<SegmentMinMax> segmentMinMaxList, DataMapFilter filter,
+ Set<TableBlockIndexUniqueIdentifier> identifiers,
+ List<TableBlockIndexUniqueIdentifierWrapper> tableBlockIndexUniqueIdentifierWrappers) {
+ boolean isScanRequired = false;
+ for (SegmentMinMax segmentMinMax : segmentMinMaxList) {
+ Map<String, SegmentBlockMinMaxInfo> segmentBlockMinMaxInfoMap =
+ segmentMinMax.getSegmentBlockMinMaxInfo();
+ int length = segmentBlockMinMaxInfoMap.size();
+ // Add columnSchemas based on the columns present in segment
+ List<ColumnSchema> columnSchemas = new ArrayList<>();
+ byte[][] min = new byte[length][];
+ byte[][] max = new byte[length][];
+ boolean[] minMaxFlag = new boolean[length];
+ int i = 0;
+
+ // get current columnSchema list for the table
+ Map<String, ColumnSchema> tableColumnSchemas =
+ this.getCarbonTable().getTableInfo().getFactTable().getListOfColumns().stream()
+ .collect(Collectors.toMap(ColumnSchema::getColumnUniqueId, ColumnSchema::clone));

Review comment:
why need to clone and modify the column schema ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389609287

##########
File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java
##########
@@ -186,6 +217,110 @@ private void getTableBlockUniqueIdentifierWrappers(List<PartitionSpec> partition
}
}

+ /**
+ * Using blockLevel minmax values, identify if segment has to be added for further pruning and to
+ * load segment index info to cache
+ * @param segment to be identified if needed for loading block datamaps
+ * @param segmentMinMaxList list of block level min max values
+ * @param filter filter expression
+ * @param identifiers tableBlockIndexUniqueIdentifiers
+ * @param tableBlockIndexUniqueIdentifierWrappers to add tableBlockIndexUniqueIdentifiers
+ */
+ private void getTableBlockIndexUniqueIdentifierUsingSegmentMinMax(Segment segment,
+ List<SegmentMinMax> segmentMinMaxList, DataMapFilter filter,
+ Set<TableBlockIndexUniqueIdentifier> identifiers,
+ List<TableBlockIndexUniqueIdentifierWrapper> tableBlockIndexUniqueIdentifierWrappers) {
+ boolean isScanRequired = false;
+ for (SegmentMinMax segmentMinMax : segmentMinMaxList) {
+ Map<String, SegmentBlockMinMaxInfo> segmentBlockMinMaxInfoMap =
+ segmentMinMax.getSegmentBlockMinMaxInfo();
+ int length = segmentBlockMinMaxInfoMap.size();
+ // Add columnSchemas based on the columns present in segment
+ List<ColumnSchema> columnSchemas = new ArrayList<>();
+ byte[][] min = new byte[length][];
+ byte[][] max = new byte[length][];
+ boolean[] minMaxFlag = new boolean[length];
+ int i = 0;
+
+ // get current columnSchema list for the table
+ Map<String, ColumnSchema> tableColumnSchemas =

Review comment:
It should be based on `COLUMN_META_CACHE` table property. Don't always load for all the columns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389621798

##########
File path: docs/configuration-parameters.md
##########
@@ -146,6 +146,7 @@ This section provides the details of all the configurations required for the Car
| carbon.query.prefetch.enable | true | By default this property is true, so prefetch is used in query to read next blocklet asynchronously in other thread while processing current blocklet in main thread. This can help to reduce CPU idle time. Setting this property false will disable this prefetch feature in query. |
| carbon.query.stage.input.enable | false | Stage input files are data files written by external applications (such as Flink), but have not been loaded into carbon table. Enabling this configuration makes query to include these files, thus makes query on latest data. However, since these files are not indexed, query maybe slower as full scan is required for these files. |
| carbon.driver.pruning.multi.thread.enable.files.count | 100000 | To prune in multi-thread when total number of segment files for a query increases beyond the configured value. |
+| carbon.load.all.indexes.to.cache | true | Setting this configuration to false, will prune and load only matched segment indexes to cache using segment minmax information, which decreases the usage of driver memory. |

Review comment:
when it is changed from `false to true`, cache needs to be dropped. Where is that handling in the PR ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389637371

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -479,7 +488,17 @@ object CarbonDataRDDFactory {
segmentDetails.add(new Segment(resultOfBlock._2._1.getLoadName))
}
}
- val segmentFiles = updateSegmentFiles(carbonTable, segmentDetails, updateModel.get)
+ var segmentMinMaxMap: Map[String, List[SegmentMinMax]] = Map()
+ if (!segmentMinMaxAccumulator.isZero) {
+ segmentMinMaxAccumulator.value.asScala.foreach(map => if (map.nonEmpty) {
+ segmentMinMaxMap = segmentMinMaxMap ++ map
+ })
+ }
+ val segmentFiles = updateSegmentFiles(carbonTable,

Review comment:
Segment min max for a column means, it is a single min max value.
Now we are storing blocklevel minmax (it is a duplicate storage as already this info is there in file footer).
we need to have a comparator here and find the lowest and highest value form all the blocks and store that value in the segment file.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389640631

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -390,9 +399,9 @@ object CarbonDataRDDFactory {
carbonLoadModel,
hadoopConf)
} else if (dataFrame.isDefined) {
- loadDataFrame(sqlContext, dataFrame, None, carbonLoadModel)
+ loadDataFrame(sqlContext, dataFrame, None, carbonLoadModel, segmentMinMaxAccumulator)

Review comment:
For compaction, we are recalculating the min max again by reading each block file?
we can just find the min and max from the segments involved in the compaction. No need to compute segment min may by reading each data file.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389638371

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -479,7 +488,17 @@ object CarbonDataRDDFactory {
segmentDetails.add(new Segment(resultOfBlock._2._1.getLoadName))
}
}
- val segmentFiles = updateSegmentFiles(carbonTable, segmentDetails, updateModel.get)
+ var segmentMinMaxMap: Map[String, List[SegmentMinMax]] = Map()
+ if (!segmentMinMaxAccumulator.isZero) {
+ segmentMinMaxAccumulator.value.asScala.foreach(map => if (map.nonEmpty) {
+ segmentMinMaxMap = segmentMinMaxMap ++ map
+ })
+ }
+ val segmentFiles = updateSegmentFiles(carbonTable,

Review comment:
Because in the realtime, one segment can have millions of small data files. This won't improve pruning performance if we check each file.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389593661

##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -2406,4 +2406,14 @@ private CarbonCommonConstants() {
public static final String BUCKET_COLUMNS = "bucket_columns";
public static final String BUCKET_NUMBER = "bucket_number";

+ /**
+ * Load all indexes to carbon LRU cache
+ */
+ public static final String CARBON_LOAD_ALL_INDEX_TO_CACHE = "carbon.load.all.indexes.to.cache";

Review comment:
As it is segment level, better to keep the name as `carbon.load.all.segment.indexes.to.cache`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389642552

##########
File path: core/src/main/java/org/apache/carbondata/core/util/SegmentMinMax.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.Serializable;
+import java.util.Map;
+
+/**
+ * Represent SegmentBlockMinMaxInfo for each block in a segment
+ */
+public class SegmentMinMax implements Serializable {

Review comment:
keep it as `BlockMetadataInfo`, because it is not just the min max, it has other info also.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389633770

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -375,7 +380,11 @@ object CarbonDataRDDFactory {
carbonLoadModel,
hadoopConf)
} else {
- loadDataFrame(sqlContext, None, Some(convertedRdd), carbonLoadModel)
+ loadDataFrame(sqlContext,

Review comment:
Need to pass accumulator to range sort, global sort flow in this file also.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389656759

##########
File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java
##########
@@ -186,6 +217,110 @@ private void getTableBlockUniqueIdentifierWrappers(List<PartitionSpec> partition
}
}

+ /**
+ * Using blockLevel minmax values, identify if segment has to be added for further pruning and to
+ * load segment index info to cache
+ * @param segment to be identified if needed for loading block datamaps
+ * @param segmentMinMaxList list of block level min max values
+ * @param filter filter expression
+ * @param identifiers tableBlockIndexUniqueIdentifiers
+ * @param tableBlockIndexUniqueIdentifierWrappers to add tableBlockIndexUniqueIdentifiers
+ */
+ private void getTableBlockIndexUniqueIdentifierUsingSegmentMinMax(Segment segment,
+ List<SegmentMinMax> segmentMinMaxList, DataMapFilter filter,
+ Set<TableBlockIndexUniqueIdentifier> identifiers,
+ List<TableBlockIndexUniqueIdentifierWrapper> tableBlockIndexUniqueIdentifierWrappers) {
+ boolean isScanRequired = false;
+ for (SegmentMinMax segmentMinMax : segmentMinMaxList) {
+ Map<String, SegmentBlockMinMaxInfo> segmentBlockMinMaxInfoMap =
+ segmentMinMax.getSegmentBlockMinMaxInfo();
+ int length = segmentBlockMinMaxInfoMap.size();
+ // Add columnSchemas based on the columns present in segment
+ List<ColumnSchema> columnSchemas = new ArrayList<>();
+ byte[][] min = new byte[length][];
+ byte[][] max = new byte[length][];
+ boolean[] minMaxFlag = new boolean[length];
+ int i = 0;
+
+ // get current columnSchema list for the table
+ Map<String, ColumnSchema> tableColumnSchemas =
+ this.getCarbonTable().getTableInfo().getFactTable().getListOfColumns().stream()
+ .collect(Collectors.toMap(ColumnSchema::getColumnUniqueId, ColumnSchema::clone));

Review comment:
it will modify the current carbonTable columnSchemas list, if we use the same object list

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389660374

##########
File path: docs/configuration-parameters.md
##########
@@ -146,6 +146,7 @@ This section provides the details of all the configurations required for the Car
| carbon.query.prefetch.enable | true | By default this property is true, so prefetch is used in query to read next blocklet asynchronously in other thread while processing current blocklet in main thread. This can help to reduce CPU idle time. Setting this property false will disable this prefetch feature in query. |
| carbon.query.stage.input.enable | false | Stage input files are data files written by external applications (such as Flink), but have not been loaded into carbon table. Enabling this configuration makes query to include these files, thus makes query on latest data. However, since these files are not indexed, query maybe slower as full scan is required for these files. |
| carbon.driver.pruning.multi.thread.enable.files.count | 100000 | To prune in multi-thread when total number of segment files for a query increases beyond the configured value. |
+| carbon.load.all.indexes.to.cache | true | Setting this configuration to false, will prune and load only matched segment indexes to cache using segment minmax information, which decreases the usage of driver memory. |

Review comment:
it is for session level carbon property. Cannot change it dynamically. So, once we restart the session, cache will be automatically cleared

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389756244

##########
File path: core/src/main/java/org/apache/carbondata/core/util/SegmentMinMax.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.Serializable;
+import java.util.Map;
+
+/**
+ * Represent SegmentBlockMinMaxInfo for each block in a segment
+ */
+public class SegmentMinMax implements Serializable {

Review comment:
Should be SegmentMetaDataInfo right?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #3584: [CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#discussion_r389813490

##########
File path: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java
##########
@@ -186,6 +217,110 @@ private void getTableBlockUniqueIdentifierWrappers(List<PartitionSpec> partition
}
}

+ /**
+ * Using blockLevel minmax values, identify if segment has to be added for further pruning and to
+ * load segment index info to cache
+ * @param segment to be identified if needed for loading block datamaps
+ * @param segmentMinMaxList list of block level min max values
+ * @param filter filter expression
+ * @param identifiers tableBlockIndexUniqueIdentifiers
+ * @param tableBlockIndexUniqueIdentifierWrappers to add tableBlockIndexUniqueIdentifiers
+ */
+ private void getTableBlockIndexUniqueIdentifierUsingSegmentMinMax(Segment segment,
+ List<SegmentMinMax> segmentMinMaxList, DataMapFilter filter,
+ Set<TableBlockIndexUniqueIdentifier> identifiers,
+ List<TableBlockIndexUniqueIdentifierWrapper> tableBlockIndexUniqueIdentifierWrappers) {
+ boolean isScanRequired = false;
+ for (SegmentMinMax segmentMinMax : segmentMinMaxList) {
+ Map<String, SegmentBlockMinMaxInfo> segmentBlockMinMaxInfoMap =
+ segmentMinMax.getSegmentBlockMinMaxInfo();
+ int length = segmentBlockMinMaxInfoMap.size();
+ // Add columnSchemas based on the columns present in segment
+ List<ColumnSchema> columnSchemas = new ArrayList<>();
+ byte[][] min = new byte[length][];
+ byte[][] max = new byte[length][];
+ boolean[] minMaxFlag = new boolean[length];
+ int i = 0;
+
+ // get current columnSchema list for the table
+ Map<String, ColumnSchema> tableColumnSchemas =

Review comment:
Currently loading index for COLUMN_META_CACHE columns only. But segmentminmax is written for all columns to support segment minmax pruning for all columns

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#issuecomment-596741171

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/695/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#issuecomment-596743677

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2401/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#issuecomment-596934685

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/699/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#issuecomment-596934952

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2405/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ajantha-bhat commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

ajantha-bhat commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#issuecomment-596937019

@Indhumathi27 : make that carbon property deafult value to false
and In query flow, If the table is transactional table and segment minmax is not set. Throw runtime exception. So, that CI can catch all the missed scenarios.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3584: [WIP][CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
URL: https://github.com/apache/carbondata/pull/3584#issuecomment-597153632

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/706/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

1 ... 4567891011