Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/110/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/278/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8348/ --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2683#discussion_r215669915 --- Diff: pom.xml --- @@ -706,6 +706,12 @@ <module>datamap/mv/core</module> </modules> </profile> + <profile> + <id>tool</id> --- End diff -- suggest using "tools" --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2683 @jackylk Better create another folder under tools --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2683 ``` ## Summary 1 blocks, 1 shards, 1 blocklets, 9 pages, 259,304 rows, 9.95MB ## Column Statistics (column 'L_DISCOUNT') Shard #1 (72636812283890_batchno0-0-null-1536219825841) BLK BLKLT Meta Size Data Size Card Min/Max range (total width is 80 characters) 0 0 1.06KB 9.75MB 2,147,483,647 -------------------------------------------------------------------------------- ``` 1. In the above, my actual file size is 10.4 MB but it shows only 9.95 MB. 2. In column statistics of L_DISCOUNT it shows 9.95 MB, it means it does not show only that column size? 3. What is `Card` here?is it cardinality? It does not make senese to print Integer max I guess --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2683 @jackylk Better print local_dictionary enabled or not in schema. And also if possible please print the local dictionary size of each column in column details. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2683#discussion_r216316121 --- Diff: pom.xml --- @@ -706,6 +706,12 @@ <module>datamap/mv/core</module> </modules> </profile> + <profile> + <id>tool</id> --- End diff -- ok, fixed --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/209/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8448/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/378/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/235/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8475/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/405/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/236/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/239/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8478/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2683 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/408/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2683#discussion_r216968250 --- Diff: tools/cli/src/main/java/org/apache/carbondata/tool/DataSummary.java --- @@ -0,0 +1,360 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.tool; + +import java.io.IOException; +import java.io.PrintStream; +import java.nio.charset.Charset; +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.HashSet; +import java.util.LinkedHashMap; +import java.util.LinkedList; +import java.util.List; +import java.util.Map; +import java.util.Set; + +import org.apache.carbondata.common.Strings; +import org.apache.carbondata.core.datastore.filesystem.CarbonFile; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema; +import org.apache.carbondata.core.reader.CarbonHeaderReader; +import org.apache.carbondata.core.statusmanager.LoadMetadataDetails; +import org.apache.carbondata.core.statusmanager.SegmentStatusManager; +import org.apache.carbondata.core.util.CarbonUtil; +import org.apache.carbondata.core.util.path.CarbonTablePath; +import org.apache.carbondata.format.BlockletInfo3; +import org.apache.carbondata.format.FileFooter3; +import org.apache.carbondata.format.FileHeader; +import org.apache.carbondata.format.TableInfo; + +import static org.apache.carbondata.core.constants.CarbonCommonConstants.DEFAULT_CHARSET; + +/** + * Data Summary command implementation for {@link CarbonCli} + */ +class DataSummary { + private String dataFolder; + private PrintStream out; + + private long numBlock; + private long numShard; + private long numBlocklet; + private long numPage; + private long numRow; + private long totalDataSize; + + // file path mapping to file object + private LinkedHashMap<String, DataFile> dataFiles = new LinkedHashMap<>(); + private CarbonFile tableStatusFile; + private CarbonFile schemaFile; + + DataSummary(String dataFolder, PrintStream out) throws IOException { + this.dataFolder = dataFolder; + this.out = out; + collectDataFiles(); + } + + private boolean isColumnarFile(String fileName) { + // if the timestamp in file name is "0", it is a streaming file + return fileName.endsWith(CarbonTablePath.CARBON_DATA_EXT) && + !CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName).equals("0"); + } + + private boolean isStreamFile(String fileName) { + // if the timestamp in file name is "0", it is a streaming file + return fileName.endsWith(CarbonTablePath.CARBON_DATA_EXT) && + CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName).equals("0"); + } + + private void collectDataFiles() throws IOException { + Set<String> shards = new HashSet<>(); + CarbonFile folder = FileFactory.getCarbonFile(dataFolder); + List<CarbonFile> files = folder.listFiles(true); + List<DataFile> unsortedFiles = new ArrayList<>(); + for (CarbonFile file : files) { + if (isColumnarFile(file.getName())) { + DataFile dataFile = new DataFile(file); + unsortedFiles.add(dataFile); + collectNum(dataFile.getFooter()); + shards.add(dataFile.getShardName()); + totalDataSize += file.getSize(); + } else if (file.getName().endsWith(CarbonTablePath.TABLE_STATUS_FILE)) { + tableStatusFile = file; + } else if (file.getName().startsWith(CarbonTablePath.SCHEMA_FILE)) { + schemaFile = file; + } else if (isStreamFile(file.getName())) { + out.println("WARN: input path contains streaming file, this tool does not support it yet, " + + "skipping it..."); + } + } + unsortedFiles.sort((o1, o2) -> { + if (o1.getShardName().equalsIgnoreCase(o2.getShardName())) { + return Integer.parseInt(o1.getPartNo()) - Integer.parseInt(o2.getPartNo()); + } else { + return o1.getShardName().hashCode() - o2.getShardName().hashCode(); --- End diff -- Why not sort by the alphabet sequence of the shardName directly? --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2683#discussion_r216965853 --- Diff: core/src/main/java/org/apache/carbondata/core/util/DataTypeUtil.java --- @@ -168,6 +168,65 @@ public static Object getMeasureObjectBasedOnDataType(ColumnPage measurePage, int } } + /** + * Calculate data percentage in [min, max] scope based on data type + * @param data data to calculate the percentage + * @param min min value + * @param max max value + * @param column column schema including data type + * @return result + */ + public static double computePercentage(byte[] data, byte[] min, byte[] max, ColumnSchema column) { + if (column.getDataType() == DataTypes.STRING) { + // for string, we do not calculate + return 0; + } else if (DataTypes.isDecimal(column.getDataType())) { + BigDecimal minValue = DataTypeUtil.byteToBigDecimal(min); + BigDecimal dataValue = DataTypeUtil.byteToBigDecimal(data).subtract(minValue); + BigDecimal factorValue = DataTypeUtil.byteToBigDecimal(max).subtract(minValue); + return dataValue.divide(factorValue).doubleValue(); + } + double dataValue, minValue, factorValue; + if (column.getDataType() == DataTypes.SHORT) { + minValue = ByteUtil.toShort(min, 0); + dataValue = ByteUtil.toShort(data, 0) - minValue; + factorValue = ByteUtil.toShort(max, 0) - ByteUtil.toShort(min, 0); + } else if (column.getDataType() == DataTypes.INT) { + if (column.isSortColumn()) { + minValue = ByteUtil.toXorInt(min, 0, min.length); + dataValue = ByteUtil.toXorInt(data, 0, data.length) - minValue; + factorValue = ByteUtil.toXorInt(max, 0, max.length) - ByteUtil.toXorInt(min, 0, min.length); + } else { + minValue = ByteUtil.toLong(min, 0, min.length); + dataValue = ByteUtil.toLong(data, 0, data.length) - minValue; + factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length); + } + } else if (column.getDataType() == DataTypes.LONG) { + minValue = ByteUtil.toLong(min, 0, min.length); + dataValue = ByteUtil.toLong(data, 0, data.length) - minValue; + factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length); + } else if (column.getDataType() == DataTypes.DATE) { + minValue = ByteUtil.toInt(min, 0, min.length); + dataValue = ByteUtil.toInt(data, 0, data.length) - minValue; + factorValue = ByteUtil.toInt(max, 0, max.length) - ByteUtil.toInt(min, 0, min.length); + } else if (column.getDataType() == DataTypes.TIMESTAMP) { + minValue = ByteUtil.toLong(min, 0, min.length); + dataValue = ByteUtil.toLong(data, 0, data.length) - minValue; + factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length); + } else if (column.getDataType() == DataTypes.DOUBLE) { + minValue = ByteUtil.toDouble(min, 0, min.length); + dataValue = ByteUtil.toDouble(data, 0, data.length) - minValue; + factorValue = ByteUtil.toDouble(max, 0, max.length) - ByteUtil.toDouble(min, 0, min.length); + } else { + throw new UnsupportedOperationException("data type: " + column.getDataType()); + } + + if (factorValue == 0d) { + return Double.MIN_VALUE; --- End diff -- If the value for the column is constant, the 'factorValue' here will be '0'. And I think the percentage should be '1' instead of 'Double.MIN_VALUE'. --- |
Free forum by Nabble | Edit this page |