Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2869: [WIP] Changes for improving carbon reader per...

Classic

List

116 messages Options

Options

123456

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230299553

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReaderBuilder.java ---
@@ -51,6 +54,7 @@
private Expression filterExpression;
private String tableName;
private Configuration hadoopConf;
+ private boolean useVectorReader;
--- End diff --

ok, And also make sure that complex datatypes should be switched to normal record reader automatically

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9497/

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user kunal642 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230305381

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReaderBuilder.java ---
@@ -158,14 +173,31 @@ public CarbonReaderBuilder withHadoopConf(String key, String value) {
}

try {
- final List<InputSplit> splits =
- format.getSplits(new JobContextImpl(job.getConfiguration(), new JobID()));
-
+ List<InputSplit> splits;
+ if (filterExpression == null) {
+ splits = format.getAllFileSplits(job);
+ } else {
+ splits = format.getSplits(new JobContextImpl(job.getConfiguration(), new JobID()));
+ }
List<RecordReader<Void, T>> readers = new ArrayList<>(splits.size());
for (InputSplit split : splits) {
TaskAttemptContextImpl attempt =
new TaskAttemptContextImpl(job.getConfiguration(), new TaskAttemptID());
- RecordReader reader = format.createRecordReader(split, attempt);
+ RecordReader reader;
+ QueryModel queryModel = format.createQueryModel(split, attempt);
+ boolean hasComplex = false;
+ for (ProjectionDimension projectionDimension : queryModel.getProjectionDimensions()) {
+ if (projectionDimension.getDimension().isComplex()) {
--- End diff --

sure

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1237/

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1239/

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1454/

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9503/

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1246/

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1462/

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9511/

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1248/

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230383031

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java ---
@@ -524,6 +524,28 @@ public DataOutputStream getDataOutputStreamUsingAppend(String path, FileFactory.
return getFiles(listStatus);
}

+ /**
+ * Method used to list files recursively and apply file filter on the result.
+ *
+ */
+ @Override
+
+ public List<CarbonFile> listFiles(Boolean recursive, CarbonFileFilter fileFilter)
--- End diff --

`Boolean` is a wrapper class...can you make it primitive and use

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230384381

--- Diff: docs/sdk-guide.md ---
@@ -442,6 +442,16 @@ public CarbonWriterBuilder withJsonInput(Schema carbonSchema);
public CarbonWriter build() throws IOException, InvalidLoadOptionException;
```

+```
+ /**
+ * Configure Vector Reader for carbonReader.
+ *
+ * @param useVectorReader true will enable vector reader, false will enable record reader.
+ *
+ */
+ public CarbonReaderBuilder withVectorReader(boolean useVectorReader)
--- End diff --

I think the method signature can be only `public CarbonReaderBuilder withVectorReader`...adding parameter to method is not required

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230384734

--- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonRecordReader.java ---
@@ -78,6 +81,15 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext context)
List<CarbonInputSplit> splitList;
if (inputSplit instanceof CarbonInputSplit) {
splitList = new ArrayList<>(1);
+ String splitPath = ((CarbonInputSplit) inputSplit).getPath().toString();
+ if (((CarbonInputSplit) inputSplit).getDetailInfo().getBlockFooterOffset() == 0L) {
--- End diff --

Add a comment to explain the scenario when blockFooterOffset will 0L

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230387678

--- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/util/CarbonVectorizedRecordReader.java ---
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.hadoop.util;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonV3DataFormatConstants;
+import org.apache.carbondata.core.datastore.FileReader;
+import org.apache.carbondata.core.datastore.block.TableBlockInfo;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.metadata.datatype.DataType;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.carbondata.core.metadata.datatype.DecimalType;
+import org.apache.carbondata.core.metadata.datatype.StructField;
+import org.apache.carbondata.core.scan.executor.QueryExecutor;
+import org.apache.carbondata.core.scan.executor.QueryExecutorFactory;
+import org.apache.carbondata.core.scan.executor.exception.QueryExecutionException;
+import org.apache.carbondata.core.scan.model.ProjectionDimension;
+import org.apache.carbondata.core.scan.model.ProjectionMeasure;
+import org.apache.carbondata.core.scan.model.QueryModel;
+import org.apache.carbondata.core.scan.result.iterator.AbstractDetailQueryResultIterator;
+import org.apache.carbondata.core.scan.result.vector.CarbonColumnVector;
+import org.apache.carbondata.core.scan.result.vector.CarbonColumnarBatch;
+import org.apache.carbondata.core.scan.result.vector.impl.CarbonColumnVectorImpl;
+import org.apache.carbondata.core.util.ByteUtil;
+import org.apache.carbondata.hadoop.AbstractRecordReader;
+import org.apache.carbondata.hadoop.CarbonInputSplit;
+
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.log4j.Logger;
+
+/**
+ * A specialized RecordReader that reads into CarbonColumnarBatches directly using the
+ * carbondata column APIs and fills the data directly into columns.
+ */
+public class CarbonVectorizedRecordReader extends AbstractRecordReader<Object> {
--- End diff --

If feasible try to merge the common code from `CarbonVectorizedRecordReader` and `VectorizedCarbonRecordReader`. This will be helpful for maintainability of code

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230389069

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReaderBuilder.java ---
@@ -158,14 +172,31 @@ public CarbonReaderBuilder withHadoopConf(String key, String value) {
}

try {
- final List<InputSplit> splits =
- format.getSplits(new JobContextImpl(job.getConfiguration(), new JobID()));

+ if (filterExpression == null) {
+ job.getConfiguration().set("filter_blocks", "false");
+ }
+ List<InputSplit> splits =
+ format.getSplits(new JobContextImpl(job.getConfiguration(), new JobID()));
List<RecordReader<Void, T>> readers = new ArrayList<>(splits.size());
for (InputSplit split : splits) {
TaskAttemptContextImpl attempt =
new TaskAttemptContextImpl(job.getConfiguration(), new TaskAttemptID());
- RecordReader reader = format.createRecordReader(split, attempt);
+ RecordReader reader;
+ QueryModel queryModel = format.createQueryModel(split, attempt);
+ boolean hasComplex = false;
+ for (ProjectionDimension projectionDimension : queryModel.getProjectionDimensions()) {
+ if (projectionDimension.getDimension().isComplex()) {
+ hasComplex = true;
+ break;
+ }
+ }
+ if (useVectorReader && !hasComplex) {
--- End diff --

as a test scenario..test a query which has a schema of more than 100 columns and see if it works fine

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230383113

--- Diff: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java ---
@@ -524,6 +524,28 @@ public DataOutputStream getDataOutputStreamUsingAppend(String path, FileFactory.
return getFiles(listStatus);
}

+ /**
+ * Method used to list files recursively and apply file filter on the result.
+ *
+ */
+ @Override
+
--- End diff --

Remove this extra line

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230388489

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReaderBuilder.java ---
@@ -118,6 +122,16 @@ public CarbonReaderBuilder withHadoopConf(String key, String value) {
return this;
}

+ /**
+ * Configure Vector Reader for carbonReader.
+ *
+ */
+ @Deprecated
+ public CarbonReaderBuilder withVectorReader(boolean useVectorReader) {
+ this.useVectorReader = useVectorReader;
+ return this;
--- End diff --

As per the comment given earlier modify the method and set the flag to true when this method is invoked

---

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2869#discussion_r230386178

--- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonFileInputFormat.java ---
@@ -145,9 +154,30 @@ public CarbonTable getOrCreateCarbonTable(Configuration configuration) throws IO
externalTableSegments.add(seg);
}
}
- // do block filtering and get split
- List<InputSplit> splits =
- getSplits(job, filter, externalTableSegments, null, partitionInfo, null);
+ List<InputSplit> splits = new ArrayList<>();
+ boolean useBlockDataMap = job.getConfiguration().getBoolean("filter_blocks", true);
+ if (useBlockDataMap) {
+ // do block filtering and get split
+ splits = getSplits(job, filter, externalTableSegments, null, partitionInfo, null);
+ } else {
+ for (CarbonFile carbonFile : getAllCarbonDataFiles(carbonTable.getTablePath())) {
+ CarbonInputSplit split =
+ new CarbonInputSplit("null", new Path(carbonFile.getAbsolutePath()), 0,
+ carbonFile.getLength(), carbonFile.getLocations(), FileFormat.COLUMNAR_V3);
--- End diff --

Add a comment why "null" is passed. Better to use it from a constant. Also explain the scenarios when if and else block will be executed

---

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2869

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9513/

---

123456