[GitHub] carbondata pull request #2683: [WIP] Print data folder information

classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/110/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/278/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Failed  with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8348/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2683#discussion_r215669915
 
    --- Diff: pom.xml ---
    @@ -706,6 +706,12 @@
             <module>datamap/mv/core</module>
           </modules>
         </profile>
    +    <profile>
    +      <id>tool</id>
    --- End diff --
   
    suggest using "tools"


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    @jackylk Better create another folder under tools


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    ```
    ## Summary
    1 blocks, 1 shards, 1 blocklets, 9 pages, 259,304 rows, 9.95MB
   
    ## Column Statistics (column 'L_DISCOUNT')
    Shard #1 (72636812283890_batchno0-0-null-1536219825841)
    BLK  BLKLT  Meta Size  Data Size  Card           Min/Max range (total width is 80 characters)                                      
    0    0      1.06KB     9.75MB     2,147,483,647  --------------------------------------------------------------------------------  
    ```
    1. In the above, my actual file size is 10.4 MB but it shows only 9.95 MB.
    2. In column statistics of L_DISCOUNT it shows 9.95 MB, it means it does not show only that column size?
    3. What is `Card` here?is it cardinality? It does not make senese to print Integer max I guess
   



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    @jackylk Better print local_dictionary enabled or not in schema.
    And also if possible please print the local dictionary size of each column in column details.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2683#discussion_r216316121
 
    --- Diff: pom.xml ---
    @@ -706,6 +706,12 @@
             <module>datamap/mv/core</module>
           </modules>
         </profile>
    +    <profile>
    +      <id>tool</id>
    --- End diff --
   
    ok, fixed


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/209/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Failed  with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8448/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/378/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/235/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Failed  with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8475/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/405/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/236/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/239/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Failed  with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8478/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2683
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/408/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2683#discussion_r216968250
 
    --- Diff: tools/cli/src/main/java/org/apache/carbondata/tool/DataSummary.java ---
    @@ -0,0 +1,360 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.tool;
    +
    +import java.io.IOException;
    +import java.io.PrintStream;
    +import java.nio.charset.Charset;
    +import java.util.ArrayList;
    +import java.util.Collection;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.LinkedList;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Set;
    +
    +import org.apache.carbondata.common.Strings;
    +import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
    +import org.apache.carbondata.core.datastore.impl.FileFactory;
    +import org.apache.carbondata.core.memory.MemoryException;
    +import org.apache.carbondata.core.metadata.datatype.DataTypes;
    +import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema;
    +import org.apache.carbondata.core.reader.CarbonHeaderReader;
    +import org.apache.carbondata.core.statusmanager.LoadMetadataDetails;
    +import org.apache.carbondata.core.statusmanager.SegmentStatusManager;
    +import org.apache.carbondata.core.util.CarbonUtil;
    +import org.apache.carbondata.core.util.path.CarbonTablePath;
    +import org.apache.carbondata.format.BlockletInfo3;
    +import org.apache.carbondata.format.FileFooter3;
    +import org.apache.carbondata.format.FileHeader;
    +import org.apache.carbondata.format.TableInfo;
    +
    +import static org.apache.carbondata.core.constants.CarbonCommonConstants.DEFAULT_CHARSET;
    +
    +/**
    + * Data Summary command implementation for {@link CarbonCli}
    + */
    +class DataSummary {
    +  private String dataFolder;
    +  private PrintStream out;
    +
    +  private long numBlock;
    +  private long numShard;
    +  private long numBlocklet;
    +  private long numPage;
    +  private long numRow;
    +  private long totalDataSize;
    +
    +  // file path mapping to file object
    +  private LinkedHashMap<String, DataFile> dataFiles = new LinkedHashMap<>();
    +  private CarbonFile tableStatusFile;
    +  private CarbonFile schemaFile;
    +
    +  DataSummary(String dataFolder, PrintStream out) throws IOException {
    +    this.dataFolder = dataFolder;
    +    this.out = out;
    +    collectDataFiles();
    +  }
    +
    +  private boolean isColumnarFile(String fileName) {
    +    // if the timestamp in file name is "0", it is a streaming file
    +    return fileName.endsWith(CarbonTablePath.CARBON_DATA_EXT) &&
    +        !CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName).equals("0");
    +  }
    +
    +  private boolean isStreamFile(String fileName) {
    +    // if the timestamp in file name is "0", it is a streaming file
    +    return fileName.endsWith(CarbonTablePath.CARBON_DATA_EXT) &&
    +        CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName).equals("0");
    +  }
    +
    +  private void collectDataFiles() throws IOException {
    +    Set<String> shards = new HashSet<>();
    +    CarbonFile folder = FileFactory.getCarbonFile(dataFolder);
    +    List<CarbonFile> files = folder.listFiles(true);
    +    List<DataFile> unsortedFiles = new ArrayList<>();
    +    for (CarbonFile file : files) {
    +      if (isColumnarFile(file.getName())) {
    +        DataFile dataFile = new DataFile(file);
    +        unsortedFiles.add(dataFile);
    +        collectNum(dataFile.getFooter());
    +        shards.add(dataFile.getShardName());
    +        totalDataSize += file.getSize();
    +      } else if (file.getName().endsWith(CarbonTablePath.TABLE_STATUS_FILE)) {
    +        tableStatusFile = file;
    +      } else if (file.getName().startsWith(CarbonTablePath.SCHEMA_FILE)) {
    +        schemaFile = file;
    +      } else if (isStreamFile(file.getName())) {
    +        out.println("WARN: input path contains streaming file, this tool does not support it yet, "
    +            + "skipping it...");
    +      }
    +    }
    +    unsortedFiles.sort((o1, o2) -> {
    +      if (o1.getShardName().equalsIgnoreCase(o2.getShardName())) {
    +        return Integer.parseInt(o1.getPartNo()) - Integer.parseInt(o2.getPartNo());
    +      } else {
    +        return o1.getShardName().hashCode() - o2.getShardName().hashCode();
    --- End diff --
   
    Why not sort by the alphabet sequence of the shardName directly?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2683#discussion_r216965853
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/DataTypeUtil.java ---
    @@ -168,6 +168,65 @@ public static Object getMeasureObjectBasedOnDataType(ColumnPage measurePage, int
         }
       }
     
    +  /**
    +   * Calculate data percentage in [min, max] scope based on data type
    +   * @param data data to calculate the percentage
    +   * @param min min value
    +   * @param max max value
    +   * @param column column schema including data type
    +   * @return result
    +   */
    +  public static double computePercentage(byte[] data, byte[] min, byte[] max, ColumnSchema column) {
    +    if (column.getDataType() == DataTypes.STRING) {
    +      // for string, we do not calculate
    +      return 0;
    +    } else if (DataTypes.isDecimal(column.getDataType())) {
    +      BigDecimal minValue = DataTypeUtil.byteToBigDecimal(min);
    +      BigDecimal dataValue = DataTypeUtil.byteToBigDecimal(data).subtract(minValue);
    +      BigDecimal factorValue = DataTypeUtil.byteToBigDecimal(max).subtract(minValue);
    +      return dataValue.divide(factorValue).doubleValue();
    +    }
    +    double dataValue, minValue, factorValue;
    +    if (column.getDataType() == DataTypes.SHORT) {
    +      minValue = ByteUtil.toShort(min, 0);
    +      dataValue = ByteUtil.toShort(data, 0) - minValue;
    +      factorValue = ByteUtil.toShort(max, 0) - ByteUtil.toShort(min, 0);
    +    } else if (column.getDataType() == DataTypes.INT) {
    +      if (column.isSortColumn()) {
    +        minValue = ByteUtil.toXorInt(min, 0, min.length);
    +        dataValue = ByteUtil.toXorInt(data, 0, data.length) - minValue;
    +        factorValue = ByteUtil.toXorInt(max, 0, max.length) - ByteUtil.toXorInt(min, 0, min.length);
    +      } else {
    +        minValue = ByteUtil.toLong(min, 0, min.length);
    +        dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
    +        factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
    +      }
    +    } else if (column.getDataType() == DataTypes.LONG) {
    +      minValue = ByteUtil.toLong(min, 0, min.length);
    +      dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
    +      factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
    +    } else if (column.getDataType() == DataTypes.DATE) {
    +      minValue = ByteUtil.toInt(min, 0, min.length);
    +      dataValue = ByteUtil.toInt(data, 0, data.length) - minValue;
    +      factorValue = ByteUtil.toInt(max, 0, max.length) - ByteUtil.toInt(min, 0, min.length);
    +    } else if (column.getDataType() == DataTypes.TIMESTAMP) {
    +      minValue = ByteUtil.toLong(min, 0, min.length);
    +      dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
    +      factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
    +    } else if (column.getDataType() == DataTypes.DOUBLE) {
    +      minValue = ByteUtil.toDouble(min, 0, min.length);
    +      dataValue = ByteUtil.toDouble(data, 0, data.length) - minValue;
    +      factorValue = ByteUtil.toDouble(max, 0, max.length) - ByteUtil.toDouble(min, 0, min.length);
    +    } else {
    +      throw new UnsupportedOperationException("data type: " + column.getDataType());
    +    }
    +
    +    if (factorValue == 0d) {
    +      return Double.MIN_VALUE;
    --- End diff --
   
    If the value for the column is constant, the 'factorValue' here will be '0'. And I think the percentage should be '1' instead of 'Double.MIN_VALUE'.


---
1234