Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2683: [WIP] Print data folder information

Classic

List

Threaded

73 messages Options

1234

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/110/

---

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/278/

---

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8348/

---

qiuchenjian-2

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2683#discussion_r215669915

--- Diff: pom.xml ---
@@ -706,6 +706,12 @@
<module>datamap/mv/core</module>
</modules>
</profile>
+ <profile>
+ <id>tool</id>
--- End diff --

suggest using "tools"

---

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2683

@jackylk Better create another folder under tools

---

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2683

```
## Summary
1 blocks, 1 shards, 1 blocklets, 9 pages, 259,304 rows, 9.95MB

## Column Statistics (column 'L_DISCOUNT')
Shard #1 (72636812283890_batchno0-0-null-1536219825841)
BLK BLKLT Meta Size Data Size Card Min/Max range (total width is 80 characters)
0 0 1.06KB 9.75MB 2,147,483,647 --------------------------------------------------------------------------------
```
1. In the above, my actual file size is 10.4 MB but it shows only 9.95 MB.
2. In column statistics of L_DISCOUNT it shows 9.95 MB, it means it does not show only that column size?
3. What is `Card` here?is it cardinality? It does not make senese to print Integer max I guess

---

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2683

@jackylk Better print local_dictionary enabled or not in schema.
And also if possible please print the local dictionary size of each column in column details.

---

qiuchenjian-2

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2683#discussion_r216316121

--- Diff: pom.xml ---
@@ -706,6 +706,12 @@
<module>datamap/mv/core</module>
</modules>
</profile>
+ <profile>
+ <id>tool</id>
--- End diff --

ok, fixed

---

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/405/

---

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/236/

---

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2683#discussion_r216968250

--- Diff: tools/cli/src/main/java/org/apache/carbondata/tool/DataSummary.java ---
@@ -0,0 +1,360 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.tool;
+
+import java.io.IOException;
+import java.io.PrintStream;
+import java.nio.charset.Charset;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.carbondata.common.Strings;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.memory.MemoryException;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema;
+import org.apache.carbondata.core.reader.CarbonHeaderReader;
+import org.apache.carbondata.core.statusmanager.LoadMetadataDetails;
+import org.apache.carbondata.core.statusmanager.SegmentStatusManager;
+import org.apache.carbondata.core.util.CarbonUtil;
+import org.apache.carbondata.core.util.path.CarbonTablePath;
+import org.apache.carbondata.format.BlockletInfo3;
+import org.apache.carbondata.format.FileFooter3;
+import org.apache.carbondata.format.FileHeader;
+import org.apache.carbondata.format.TableInfo;
+
+import static org.apache.carbondata.core.constants.CarbonCommonConstants.DEFAULT_CHARSET;
+
+/**
+ * Data Summary command implementation for {@link CarbonCli}
+ */
+class DataSummary {
+ private String dataFolder;
+ private PrintStream out;
+
+ private long numBlock;
+ private long numShard;
+ private long numBlocklet;
+ private long numPage;
+ private long numRow;
+ private long totalDataSize;
+
+ // file path mapping to file object
+ private LinkedHashMap<String, DataFile> dataFiles = new LinkedHashMap<>();
+ private CarbonFile tableStatusFile;
+ private CarbonFile schemaFile;
+
+ DataSummary(String dataFolder, PrintStream out) throws IOException {
+ this.dataFolder = dataFolder;
+ this.out = out;
+ collectDataFiles();
+ }
+
+ private boolean isColumnarFile(String fileName) {
+ // if the timestamp in file name is "0", it is a streaming file
+ return fileName.endsWith(CarbonTablePath.CARBON_DATA_EXT) &&
+ !CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName).equals("0");
+ }
+
+ private boolean isStreamFile(String fileName) {
+ // if the timestamp in file name is "0", it is a streaming file
+ return fileName.endsWith(CarbonTablePath.CARBON_DATA_EXT) &&
+ CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName).equals("0");
+ }
+
+ private void collectDataFiles() throws IOException {
+ Set<String> shards = new HashSet<>();
+ CarbonFile folder = FileFactory.getCarbonFile(dataFolder);
+ List<CarbonFile> files = folder.listFiles(true);
+ List<DataFile> unsortedFiles = new ArrayList<>();
+ for (CarbonFile file : files) {
+ if (isColumnarFile(file.getName())) {
+ DataFile dataFile = new DataFile(file);
+ unsortedFiles.add(dataFile);
+ collectNum(dataFile.getFooter());
+ shards.add(dataFile.getShardName());
+ totalDataSize += file.getSize();
+ } else if (file.getName().endsWith(CarbonTablePath.TABLE_STATUS_FILE)) {
+ tableStatusFile = file;
+ } else if (file.getName().startsWith(CarbonTablePath.SCHEMA_FILE)) {
+ schemaFile = file;
+ } else if (isStreamFile(file.getName())) {
+ out.println("WARN: input path contains streaming file, this tool does not support it yet, "
+ + "skipping it...");
+ }
+ }
+ unsortedFiles.sort((o1, o2) -> {
+ if (o1.getShardName().equalsIgnoreCase(o2.getShardName())) {
+ return Integer.parseInt(o1.getPartNo()) - Integer.parseInt(o2.getPartNo());
+ } else {
+ return o1.getShardName().hashCode() - o2.getShardName().hashCode();
--- End diff --

Why not sort by the alphabet sequence of the shardName directly?

---

qiuchenjian-2

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2683#discussion_r216965853

--- Diff: core/src/main/java/org/apache/carbondata/core/util/DataTypeUtil.java ---
@@ -168,6 +168,65 @@ public static Object getMeasureObjectBasedOnDataType(ColumnPage measurePage, int
}
}

+ /**
+ * Calculate data percentage in [min, max] scope based on data type
+ * @param data data to calculate the percentage
+ * @param min min value
+ * @param max max value
+ * @param column column schema including data type
+ * @return result
+ */
+ public static double computePercentage(byte[] data, byte[] min, byte[] max, ColumnSchema column) {
+ if (column.getDataType() == DataTypes.STRING) {
+ // for string, we do not calculate
+ return 0;
+ } else if (DataTypes.isDecimal(column.getDataType())) {
+ BigDecimal minValue = DataTypeUtil.byteToBigDecimal(min);
+ BigDecimal dataValue = DataTypeUtil.byteToBigDecimal(data).subtract(minValue);
+ BigDecimal factorValue = DataTypeUtil.byteToBigDecimal(max).subtract(minValue);
+ return dataValue.divide(factorValue).doubleValue();
+ }
+ double dataValue, minValue, factorValue;
+ if (column.getDataType() == DataTypes.SHORT) {
+ minValue = ByteUtil.toShort(min, 0);
+ dataValue = ByteUtil.toShort(data, 0) - minValue;
+ factorValue = ByteUtil.toShort(max, 0) - ByteUtil.toShort(min, 0);
+ } else if (column.getDataType() == DataTypes.INT) {
+ if (column.isSortColumn()) {
+ minValue = ByteUtil.toXorInt(min, 0, min.length);
+ dataValue = ByteUtil.toXorInt(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toXorInt(max, 0, max.length) - ByteUtil.toXorInt(min, 0, min.length);
+ } else {
+ minValue = ByteUtil.toLong(min, 0, min.length);
+ dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
+ }
+ } else if (column.getDataType() == DataTypes.LONG) {
+ minValue = ByteUtil.toLong(min, 0, min.length);
+ dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
+ } else if (column.getDataType() == DataTypes.DATE) {
+ minValue = ByteUtil.toInt(min, 0, min.length);
+ dataValue = ByteUtil.toInt(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toInt(max, 0, max.length) - ByteUtil.toInt(min, 0, min.length);
+ } else if (column.getDataType() == DataTypes.TIMESTAMP) {
+ minValue = ByteUtil.toLong(min, 0, min.length);
+ dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
+ } else if (column.getDataType() == DataTypes.DOUBLE) {
+ minValue = ByteUtil.toDouble(min, 0, min.length);
+ dataValue = ByteUtil.toDouble(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toDouble(max, 0, max.length) - ByteUtil.toDouble(min, 0, min.length);
+ } else {
+ throw new UnsupportedOperationException("data type: " + column.getDataType());
+ }
+
+ if (factorValue == 0d) {
+ return Double.MIN_VALUE;
--- End diff --

If the value for the column is constant, the 'factorValue' here will be '0'. And I think the percentage should be '1' instead of 'Double.MIN_VALUE'.

---

1234