Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #1179: [WIP] Added the blocklet info to index file a...

Classic

List

44 messages Options

Options

123

[GitHub] carbondata pull request #1179: [WIP] Added the blocklet info to index file a...

GitHub user ravipesala opened a pull request:

https://github.com/apache/carbondata/pull/1179

[WIP] Added the blocklet info to index file and make the datamap distributable with job

In this PR following tasks are completed.
1. Added the blocklet info to the carbonindex file, so datamap not required to read each carbondata file footer to the blocklet information. This makes the datamap loading faster.

2. Made the data map distributable and added the spark job. So datamap pruning could happen distributable and pruned blocklet list would be sent to driver.

This PR cannot compile as carbondata format changes are present.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata datamap

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/1179.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1179

----

----

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [WIP] Added the blocklet info to index file and make...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

Build Failed with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/504/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [WIP] Added the blocklet info to index file and make...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3094/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [WIP] Added the blocklet info to index file and make...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3218/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [WIP] Added the blocklet info to index file and make...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

Build Failed with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/622/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [WIP] Added the blocklet info to index file and make...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/38/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [CARBONDATA-1364] Added the blocklet info to index f...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3391/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [CARBONDATA-1364] Added the blocklet info to index f...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

Build Failed with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/794/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [CARBONDATA-1364] Added the blocklet info to index f...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/121/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [CARBONDATA-1364] Added the blocklet info to index f...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/122/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [CARBONDATA-1364] Added the blocklet info to index f...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

Build Failed with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/795/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1179: [CARBONDATA-1364] Added the blocklet info to index f...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1179

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3393/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1179: [CARBONDATA-1364] Added the blocklet info to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1179#discussion_r131701680

--- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/DataMapDistributable.java ---
@@ -53,4 +63,27 @@ public void setDataMapName(String dataMapName) {
this.dataMapName = dataMapName;
}

+ public void setLocations(String[] locations) {
+ this.locations = locations;
+ }
+
+ public DataMapType getDataMapType() {
+ return dataMapType;
+ }
+
+ public void setDataMapType(DataMapType dataMapType) {
+ this.dataMapType = dataMapType;
+ }
+
+ @Override public String[] getLocations() throws IOException {
--- End diff --

move @Override to previous line

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1179: [CARBONDATA-1364] Added the blocklet info to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1179#discussion_r131702068

--- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/DataMapStoreManager.java ---
@@ -79,15 +79,15 @@ private TableDataMap createTableDataMap(AbsoluteTableIdentifier identifier,
tableDataMaps = new ArrayList<>();
dataMapMappping.put(identifier, tableDataMaps);
}
- TableDataMap dataMap = getAbstractTableDataMap(dataMapName, tableDataMaps);
+ TableDataMap dataMap = getTableDataMap(dataMapName, tableDataMaps);
if (dataMap != null) {
throw new RuntimeException("Already datamap exists in that path with type " + mapType);
}

try {
DataMapFactory dataMapFactory = mapType.getClassObject().newInstance();
dataMapFactory.init(identifier, dataMapName);
- dataMap = new TableDataMap(identifier, dataMapName, dataMapFactory);
+ dataMap = new TableDataMap(identifier, dataMapName, mapType, dataMapFactory);
--- End diff --

rename `mapType` to `dataMapType`

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1179: [CARBONDATA-1364] Added the blocklet info to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1179#discussion_r131703180

--- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMap.java ---
@@ -88,11 +87,14 @@

private int[] columnCardinality;

+ private String indexFilePath;
+
@Override public DataMapWriter getWriter() {
return null;
}

@Override public void init(String path) {
--- End diff --

suggest change `path` to `filePath`

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1179: [CARBONDATA-1364] Added the blocklet info to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1179#discussion_r131703475

--- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapDistributable.java ---
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.core.indexstore.blockletindex;
+
+import org.apache.carbondata.core.indexstore.DataMapDistributable;
+
+/**
+ * Blocklet datamap distributable
+ */
+public class BlockletDataMapDistributable extends DataMapDistributable {
+
+ private String indexFileName;
--- End diff --

path or name?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1179: [CARBONDATA-1364] Added the blocklet info to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1179#discussion_r131705354

--- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/DistributableDataMapFormat.java ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.hadoop.api;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+import org.apache.carbondata.core.indexstore.Blocklet;
+import org.apache.carbondata.core.indexstore.DataMapDistributable;
+import org.apache.carbondata.core.indexstore.DataMapStoreManager;
+import org.apache.carbondata.core.indexstore.DataMapType;
+import org.apache.carbondata.core.indexstore.TableDataMap;
+import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier;
+import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf;
+import org.apache.carbondata.hadoop.util.ObjectSerializationUtil;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+
+/**
+ * Input format for datamaps, it makes the datamap pruning distributable.
+ */
+public class DistributableDataMapFormat extends FileInputFormat<Void, Blocklet> implements
+ Serializable {
+
+ private static final String FILTER_EXP = "mapreduce.input.distributed.datamap.filter";
+
+ private AbsoluteTableIdentifier identifier;
+
+ private String dataMapName;
+
+ private DataMapType dataMapType;
+
+ private List<String> validSegments;
+
+ public DistributableDataMapFormat(AbsoluteTableIdentifier identifier, String dataMapName,
+ DataMapType dataMapType, List<String> validSegments) {
+ this.identifier = identifier;
+ this.dataMapName = dataMapName;
+ this.dataMapType = dataMapType;
+ this.validSegments = validSegments;
+ }
+
+ public static void setFilterExp(Configuration configuration, FilterResolverIntf filterExp)
+ throws IOException {
+ if (filterExp != null) {
+ String string = ObjectSerializationUtil.convertObjectToString(filterExp);
+ configuration.set(FILTER_EXP, string);
+ }
+ }
+
+ private static FilterResolverIntf getFilterExp(Configuration configuration) throws IOException {
+ String filterString = configuration.get(FILTER_EXP);
+ if (filterString != null) {
+ Object toObject = ObjectSerializationUtil.convertStringToObject(filterString);
+ return (FilterResolverIntf) toObject;
+ }
+ return null;
+ }
+
+ @Override public List<InputSplit> getSplits(JobContext job) throws IOException {
+ TableDataMap dataMap =
+ DataMapStoreManager.getInstance().getDataMap(identifier, dataMapName, dataMapType);
+ List<DataMapDistributable> distributables = dataMap.toDistributable(validSegments);
--- End diff --

Can we first filter on the segment based on minmax ? If segments are more, even map reduce will take times

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1179: [CARBONDATA-1364] Added the blocklet info to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1179#discussion_r131705534

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala ---
@@ -249,6 +249,8 @@ class CarbonScanRDD(

private def prepareInputFormatForDriver(conf: Configuration): CarbonTableInputFormat[Object] = {
CarbonTableInputFormat.setTableInfo(conf, tableInfo)
+ // to make the datamap pruning distributable
+ // CarbonTableInputFormat.setDataMapJob(conf, new SparkDataMapJob)
--- End diff --

Is this required?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1179: [CARBONDATA-1364] Added the blocklet info to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1179#discussion_r131706206

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/SparkDataMapJob.scala ---
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.spark.rdd
+
+import java.text.SimpleDateFormat
+import java.util
+import java.util.Date
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.mapreduce.{InputSplit, Job, TaskAttemptID, TaskType}
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.spark.{Partition, SparkContext, TaskContext, TaskKilledException}
+
+import org.apache.carbondata.common.logging.LogServiceFactory
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.indexstore.Blocklet
+import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf
+import org.apache.carbondata.hadoop.api.{DataMapJob, DistributableDataMapFormat}
+
+/**
+ * Spark job to execute datamap job and prune all the datamaps distributable
+ */
+class SparkDataMapJob extends DataMapJob {
+
+
+ override def execute(dataMapFormat: DistributableDataMapFormat,
+ resolverIntf: FilterResolverIntf): util.List[Blocklet] = {
+ new SparkDataMapRDD(SparkContext.getOrCreate(), dataMapFormat, resolverIntf).collect().toList
+ .asJava
+ }
+}
+
+class SparkDataMapPartition(rddId: Int, idx: Int, val inputSplit: InputSplit) extends Partition {
+ override def index: Int = idx
+
+ override def hashCode(): Int = 41 * (41 + rddId) + idx
+}
+
+class SparkDataMapRDD(sc: SparkContext,
+ dataMapFormat: DistributableDataMapFormat,
+ resolverIntf: FilterResolverIntf)
+ extends CarbonRDD[(Blocklet)](sc, Nil) {
+
+ private val jobTrackerId: String = {
+ val formatter = new SimpleDateFormat("yyyyMMddHHmm")
+ formatter.format(new Date())
+ }
+
+ override def internalCompute(split: Partition,
--- End diff --

please add some comment

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1179: [CARBONDATA-1364] Added the blocklet info to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1179#discussion_r131706507

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/SparkDataMapJob.scala ---
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.spark.rdd
+
+import java.text.SimpleDateFormat
+import java.util
+import java.util.Date
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.mapreduce.{InputSplit, Job, TaskAttemptID, TaskType}
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.spark.{Partition, SparkContext, TaskContext, TaskKilledException}
+
+import org.apache.carbondata.common.logging.LogServiceFactory
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.indexstore.Blocklet
+import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf
+import org.apache.carbondata.hadoop.api.{DataMapJob, DistributableDataMapFormat}
+
+/**
+ * Spark job to execute datamap job and prune all the datamaps distributable
+ */
+class SparkDataMapJob extends DataMapJob {
+
+
--- End diff --

remove this

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

123