Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

Classic

List

37 messages Options

Options

12

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1142#discussion_r126273657

--- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputFormat.java ---
@@ -444,9 +444,14 @@ protected Expression getFilterPredicates(Configuration configuration) {
}
}
}
+
+ // For Hive integration if we have to get the stats we have to fetch hive.query.id
+ String query_id = job.getConfiguration().get("query.id") != null ?
+ job.getConfiguration().get("query.id") :
+ job.getConfiguration().get("hive.query.id");
--- End diff --

question : where set "hive.query.id"? from hive engine ?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1142#discussion_r126273765

--- Diff: integration/hive/src/main/java/org/apache/carbondata/hive/CarbonHiveRecordReader.java ---
@@ -111,58 +108,46 @@ private void initialize(InputSplit inputSplit, Configuration conf) throws IOExce
} else {
columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
}
+
+ if (valueObj == null) {
+ valueObj = new ArrayWritable(Writable.class, new Writable[columnTypes.size()]);
+ }
--- End diff --

columnTypes.size() and queryModel.getProjectionColumns().length is different, why need to do this change ?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1142#discussion_r126274038

--- Diff: integration/hive/src/main/java/org/apache/carbondata/hive/DictionaryDecodeReadSupport.java ---
@@ -0,0 +1,288 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.hive;
+
+import java.io.IOException;
+import java.sql.Date;
+import java.sql.Timestamp;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.carbondata.core.cache.Cache;
+import org.apache.carbondata.core.cache.CacheProvider;
+import org.apache.carbondata.core.cache.CacheType;
+import org.apache.carbondata.core.cache.dictionary.Dictionary;
+import org.apache.carbondata.core.cache.dictionary.DictionaryColumnUniqueIdentifier;
+import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier;
+import org.apache.carbondata.core.metadata.datatype.DataType;
+import org.apache.carbondata.core.metadata.encoder.Encoding;
+import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn;
+import org.apache.carbondata.core.metadata.schema.table.column.CarbonDimension;
+import org.apache.carbondata.core.util.CarbonUtil;
+
+import org.apache.carbondata.hadoop.readsupport.CarbonReadSupport;
+
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.serde2.io.DateWritable;
+import org.apache.hadoop.hive.serde2.io.DoubleWritable;
+import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
+import org.apache.hadoop.hive.serde2.io.ShortWritable;
+import org.apache.hadoop.hive.serde2.io.TimestampWritable;
+
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+
+/**
+ * This is the class to decode dictionary encoded column data back to its original value.
+ */
+public class DictionaryDecodeReadSupport<T> implements CarbonReadSupport<T> {
--- End diff --

suggest changing this name to "CarbonDictionaryDecodeReadSupport"

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

In reply to this post by qiuchenjian-2

Github user cenyuhai commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1142#discussion_r126298624

--- Diff: integration/hive/src/main/java/org/apache/carbondata/hive/CarbonHiveRecordReader.java ---
@@ -88,17 +90,12 @@ private void initialize(InputSplit inputSplit, Configuration conf) throws IOExce
} catch (QueryExecutionException e) {
throw new IOException(e.getMessage(), e.getCause());
}
- if (valueObj == null) {
- valueObj =
- new ArrayWritable(Writable.class, new Writable[queryModel.getProjectionColumns().length]);
- }
-
final TypeInfo rowTypeInfo;
final List<String> columnNames;
List<TypeInfo> columnTypes;
// Get column names and sort order
final String colIds = conf.get("hive.io.file.readcolumn.ids");
- final String columnNameProperty = conf.get("hive.io.file.readcolumn.names");
+ final String columnNameProperty = conf.get(serdeConstants.LIST_COLUMNS);
--- End diff --

why change to serdeConstants.LIST_COLUMNSï¼ I had got columnNameProperty from serdeConstants.LIST_COLUMNS, but I found that the columns in serdeConstants.LIST_COLUMNS sometimes contains columns which don't belong to this table

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user cenyuhai commented on the issue:

https://github.com/apache/carbondata/pull/1142

Can you provide performance benchmark?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

In reply to this post by qiuchenjian-2

Github user bhavya411 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1142#discussion_r126392125

--- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputFormat.java ---
@@ -444,9 +444,14 @@ protected Expression getFilterPredicates(Configuration configuration) {
}
}
}
+
+ // For Hive integration if we have to get the stats we have to fetch hive.query.id
+ String query_id = job.getConfiguration().get("query.id") != null ?
+ job.getConfiguration().get("query.id") :
+ job.getConfiguration().get("hive.query.id");
--- End diff --

It is set in the configuration internally by Hive

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

In reply to this post by qiuchenjian-2

Github user bhavya411 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1142#discussion_r126392840

--- Diff: integration/hive/src/main/java/org/apache/carbondata/hive/DictionaryDecodeReadSupport.java ---
@@ -0,0 +1,288 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.hive;
+
+import java.io.IOException;
+import java.sql.Date;
+import java.sql.Timestamp;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.carbondata.core.cache.Cache;
+import org.apache.carbondata.core.cache.CacheProvider;
+import org.apache.carbondata.core.cache.CacheType;
+import org.apache.carbondata.core.cache.dictionary.Dictionary;
+import org.apache.carbondata.core.cache.dictionary.DictionaryColumnUniqueIdentifier;
+import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier;
+import org.apache.carbondata.core.metadata.datatype.DataType;
+import org.apache.carbondata.core.metadata.encoder.Encoding;
+import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn;
+import org.apache.carbondata.core.metadata.schema.table.column.CarbonDimension;
+import org.apache.carbondata.core.util.CarbonUtil;
+
+import org.apache.carbondata.hadoop.readsupport.CarbonReadSupport;
+
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.serde2.io.DateWritable;
+import org.apache.hadoop.hive.serde2.io.DoubleWritable;
+import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
+import org.apache.hadoop.hive.serde2.io.ShortWritable;
+import org.apache.hadoop.hive.serde2.io.TimestampWritable;
+
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.util.GenericArrayData;
+
+/**
+ * This is the class to decode dictionary encoded column data back to its original value.
+ */
+public class DictionaryDecodeReadSupport<T> implements CarbonReadSupport<T> {
--- End diff --

Will change the file name to CarbonDictionaryDecodeReadSupport.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

In reply to this post by qiuchenjian-2

Github user bhavya411 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1142#discussion_r126392544

--- Diff: integration/hive/src/main/java/org/apache/carbondata/hive/CarbonHiveRecordReader.java ---
@@ -111,58 +108,46 @@ private void initialize(InputSplit inputSplit, Configuration conf) throws IOExce
} else {
columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
}
+
+ if (valueObj == null) {
+ valueObj = new ArrayWritable(Writable.class, new Writable[columnTypes.size()]);
+ }
--- End diff --

Actually the data structure should be consistent, initially we were just returning the project columns and the Arraywritable was having variable length but in Parquet and ORC both implementation the ArrayWritable length was equivalent to the number of columns in table . It was causing issues in TPCh queries so that's why made changes to have Arraywritable size equivalent to number of columns and then populate the data at respective position

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user bhavya411 commented on the issue:

https://github.com/apache/carbondata/pull/1142

@cenyuhai The performance improved a lot I tested it with 5 Million records please see the attached results
[Performance.txt](https://github.com/apache/carbondata/files/1135276/Performance.txt)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on the issue:

https://github.com/apache/carbondata/pull/1142

LGTM

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on the issue:

https://github.com/apache/carbondata/pull/1142

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1142

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/442/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1142

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3030/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1142

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/443/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1142

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3031/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1142: [CARBONDATA-1271] Enhanced Performance for Hive Inte...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on the issue:

https://github.com/apache/carbondata/pull/1142

LGTM

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1142: [CARBONDATA-1271] Enhanced Performance for Hi...

In reply to this post by qiuchenjian-2

Github user asfgit closed the pull request at:

https://github.com/apache/carbondata/pull/1142

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

12