[GitHub] carbondata pull request #2869: [WIP] Changes for improving carbon reader per...

classic Classic list List threaded Threaded
116 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230299553
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReaderBuilder.java ---
    @@ -51,6 +54,7 @@
       private Expression filterExpression;
       private String tableName;
       private Configuration hadoopConf;
    +  private boolean useVectorReader;
    --- End diff --
   
    ok, And also make sure that complex datatypes should be switched to normal record reader automatically


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Failed  with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9497/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kunal642 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230305381
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReaderBuilder.java ---
    @@ -158,14 +173,31 @@ public CarbonReaderBuilder withHadoopConf(String key, String value) {
         }
     
         try {
    -      final List<InputSplit> splits =
    -          format.getSplits(new JobContextImpl(job.getConfiguration(), new JobID()));
    -
    +      List<InputSplit> splits;
    +      if (filterExpression == null) {
    +        splits = format.getAllFileSplits(job);
    +      } else {
    +        splits = format.getSplits(new JobContextImpl(job.getConfiguration(), new JobID()));
    +      }
           List<RecordReader<Void, T>> readers = new ArrayList<>(splits.size());
           for (InputSplit split : splits) {
             TaskAttemptContextImpl attempt =
                 new TaskAttemptContextImpl(job.getConfiguration(), new TaskAttemptID());
    -        RecordReader reader = format.createRecordReader(split, attempt);
    +        RecordReader reader;
    +        QueryModel queryModel = format.createQueryModel(split, attempt);
    +        boolean hasComplex = false;
    +        for (ProjectionDimension projectionDimension : queryModel.getProjectionDimensions()) {
    +          if (projectionDimension.getDimension().isComplex()) {
    --- End diff --
   
    sure


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1237/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1239/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1454/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Failed  with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9503/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1246/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1462/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Failed  with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9511/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1248/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230383031
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java ---
    @@ -524,6 +524,28 @@ public DataOutputStream getDataOutputStreamUsingAppend(String path, FileFactory.
         return getFiles(listStatus);
       }
     
    +  /**
    +   * Method used to list files recursively and apply file filter on the result.
    +   *
    +   */
    +  @Override
    +
    +  public List<CarbonFile> listFiles(Boolean recursive, CarbonFileFilter fileFilter)
    --- End diff --
   
    `Boolean` is a wrapper class...can you make it primitive and use


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230384381
 
    --- Diff: docs/sdk-guide.md ---
    @@ -442,6 +442,16 @@ public CarbonWriterBuilder withJsonInput(Schema carbonSchema);
     public CarbonWriter build() throws IOException, InvalidLoadOptionException;
     ```
     
    +```
    + /**
    +   * Configure Vector Reader for carbonReader.
    +   *
    +   * @param useVectorReader true will enable vector reader, false will enable record reader.
    +   *
    +   */
    +  public CarbonReaderBuilder withVectorReader(boolean useVectorReader)
    --- End diff --
   
    I think the method signature can be only `public CarbonReaderBuilder withVectorReader`...adding parameter to method is not required


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230384734
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonRecordReader.java ---
    @@ -78,6 +81,15 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext context)
         List<CarbonInputSplit> splitList;
         if (inputSplit instanceof CarbonInputSplit) {
           splitList = new ArrayList<>(1);
    +      String splitPath = ((CarbonInputSplit) inputSplit).getPath().toString();
    +      if (((CarbonInputSplit) inputSplit).getDetailInfo().getBlockFooterOffset() == 0L) {
    --- End diff --
   
    Add a comment to explain the scenario when blockFooterOffset will  0L


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230387678
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/util/CarbonVectorizedRecordReader.java ---
    @@ -0,0 +1,211 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.util;
    +
    +import java.io.IOException;
    +import java.nio.ByteBuffer;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.carbondata.common.logging.LogServiceFactory;
    +import org.apache.carbondata.core.constants.CarbonV3DataFormatConstants;
    +import org.apache.carbondata.core.datastore.FileReader;
    +import org.apache.carbondata.core.datastore.block.TableBlockInfo;
    +import org.apache.carbondata.core.datastore.impl.FileFactory;
    +import org.apache.carbondata.core.metadata.datatype.DataType;
    +import org.apache.carbondata.core.metadata.datatype.DataTypes;
    +import org.apache.carbondata.core.metadata.datatype.DecimalType;
    +import org.apache.carbondata.core.metadata.datatype.StructField;
    +import org.apache.carbondata.core.scan.executor.QueryExecutor;
    +import org.apache.carbondata.core.scan.executor.QueryExecutorFactory;
    +import org.apache.carbondata.core.scan.executor.exception.QueryExecutionException;
    +import org.apache.carbondata.core.scan.model.ProjectionDimension;
    +import org.apache.carbondata.core.scan.model.ProjectionMeasure;
    +import org.apache.carbondata.core.scan.model.QueryModel;
    +import org.apache.carbondata.core.scan.result.iterator.AbstractDetailQueryResultIterator;
    +import org.apache.carbondata.core.scan.result.vector.CarbonColumnVector;
    +import org.apache.carbondata.core.scan.result.vector.CarbonColumnarBatch;
    +import org.apache.carbondata.core.scan.result.vector.impl.CarbonColumnVectorImpl;
    +import org.apache.carbondata.core.util.ByteUtil;
    +import org.apache.carbondata.hadoop.AbstractRecordReader;
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.log4j.Logger;
    +
    +/**
    + * A specialized RecordReader that reads into CarbonColumnarBatches directly using the
    + * carbondata column APIs and fills the data directly into columns.
    + */
    +public class CarbonVectorizedRecordReader extends AbstractRecordReader<Object> {
    --- End diff --
   
    If feasible try to merge the common code from `CarbonVectorizedRecordReader` and `VectorizedCarbonRecordReader`. This will be helpful for maintainability of code


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230389069
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReaderBuilder.java ---
    @@ -158,14 +172,31 @@ public CarbonReaderBuilder withHadoopConf(String key, String value) {
         }
     
         try {
    -      final List<InputSplit> splits =
    -          format.getSplits(new JobContextImpl(job.getConfiguration(), new JobID()));
     
    +      if (filterExpression == null) {
    +        job.getConfiguration().set("filter_blocks", "false");
    +      }
    +      List<InputSplit> splits =
    +          format.getSplits(new JobContextImpl(job.getConfiguration(), new JobID()));
           List<RecordReader<Void, T>> readers = new ArrayList<>(splits.size());
           for (InputSplit split : splits) {
             TaskAttemptContextImpl attempt =
                 new TaskAttemptContextImpl(job.getConfiguration(), new TaskAttemptID());
    -        RecordReader reader = format.createRecordReader(split, attempt);
    +        RecordReader reader;
    +        QueryModel queryModel = format.createQueryModel(split, attempt);
    +        boolean hasComplex = false;
    +        for (ProjectionDimension projectionDimension : queryModel.getProjectionDimensions()) {
    +          if (projectionDimension.getDimension().isComplex()) {
    +            hasComplex = true;
    +            break;
    +          }
    +        }
    +        if (useVectorReader && !hasComplex) {
    --- End diff --
   
    as a test scenario..test a query which has a schema of more than 100 columns and see if it works fine


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230383113
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AbstractDFSCarbonFile.java ---
    @@ -524,6 +524,28 @@ public DataOutputStream getDataOutputStreamUsingAppend(String path, FileFactory.
         return getFiles(listStatus);
       }
     
    +  /**
    +   * Method used to list files recursively and apply file filter on the result.
    +   *
    +   */
    +  @Override
    +
    --- End diff --
   
    Remove this extra line


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230388489
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonReaderBuilder.java ---
    @@ -118,6 +122,16 @@ public CarbonReaderBuilder withHadoopConf(String key, String value) {
         return this;
       }
     
    +  /**
    +   * Configure Vector Reader for carbonReader.
    +   *
    +   */
    +  @Deprecated
    +  public CarbonReaderBuilder withVectorReader(boolean useVectorReader) {
    +    this.useVectorReader = useVectorReader;
    +    return this;
    --- End diff --
   
    As per the comment given earlier modify the method and set the flag to true when this method is invoked


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2869: [CARBONDATA-3057] Implement VectorizedReader ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2869#discussion_r230386178
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonFileInputFormat.java ---
    @@ -145,9 +154,30 @@ public CarbonTable getOrCreateCarbonTable(Configuration configuration) throws IO
               externalTableSegments.add(seg);
             }
           }
    -      // do block filtering and get split
    -      List<InputSplit> splits =
    -          getSplits(job, filter, externalTableSegments, null, partitionInfo, null);
    +      List<InputSplit> splits = new ArrayList<>();
    +      boolean useBlockDataMap = job.getConfiguration().getBoolean("filter_blocks", true);
    +      if (useBlockDataMap) {
    +        // do block filtering and get split
    +        splits = getSplits(job, filter, externalTableSegments, null, partitionInfo, null);
    +      } else {
    +        for (CarbonFile carbonFile : getAllCarbonDataFiles(carbonTable.getTablePath())) {
    +          CarbonInputSplit split =
    +              new CarbonInputSplit("null", new Path(carbonFile.getAbsolutePath()), 0,
    +                  carbonFile.getLength(), carbonFile.getLocations(), FileFormat.COLUMNAR_V3);
    --- End diff --
   
    Add a comment why "null" is passed. Better to use it from a constant. Also explain the scenarios when if and else block will be executed


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2869: [CARBONDATA-3057] Implement VectorizedReader for SDK...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2869
 
    Build Failed  with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9513/



---
123456