Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2683: [WIP] Print data folder information

Classic

List

73 messages Options

Options

1234

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2683#discussion_r216966619

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/TestUtil.java ---
@@ -136,33 +136,40 @@ public static void writeFilesAndVerify(int rows, Schema schema, String path, Str
CarbonWriter writer = builder.buildWriterForCSVInput(schema, configuration);

for (int i = 0; i < rows; i++) {
- writer.write(new String[]{"robot" + (i % 10), String.valueOf(i), String.valueOf((double) i / 2)});
+ writer.write(new String[]{
+ "robot" + (i % 10), String.valueOf(i % 3000000), String.valueOf((double) i / 2)});
}
writer.close();
- } catch (IOException e) {
+ } catch (Exception e) {
e.printStackTrace();
--- End diff --

It's not recommended to print the stack trace like this.

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2683

@jackylk I have a proposal for this.

Current implement of the data summary has too many customize output with string concatenation, which I think itâs not convenient for better extraction and parsing if someone provide the output of DataSummary to me.

I want the output can be better formatted, for example to export it in json format, so I can parse and do more analysis on it.

---

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2683#discussion_r216981072

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/TestUtil.java ---
@@ -136,33 +136,40 @@ public static void writeFilesAndVerify(int rows, Schema schema, String path, Str
CarbonWriter writer = builder.buildWriterForCSVInput(schema, configuration);

for (int i = 0; i < rows; i++) {
- writer.write(new String[]{"robot" + (i % 10), String.valueOf(i), String.valueOf((double) i / 2)});
+ writer.write(new String[]{
+ "robot" + (i % 10), String.valueOf(i % 3000000), String.valueOf((double) i / 2)});
}
writer.close();
- } catch (IOException e) {
+ } catch (Exception e) {
e.printStackTrace();
--- End diff --

fixed

---

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2683#discussion_r216981868

--- Diff: tools/cli/src/main/java/org/apache/carbondata/tool/DataSummary.java ---
@@ -0,0 +1,360 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.tool;
+
+import java.io.IOException;
+import java.io.PrintStream;
+import java.nio.charset.Charset;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.carbondata.common.Strings;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.memory.MemoryException;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema;
+import org.apache.carbondata.core.reader.CarbonHeaderReader;
+import org.apache.carbondata.core.statusmanager.LoadMetadataDetails;
+import org.apache.carbondata.core.statusmanager.SegmentStatusManager;
+import org.apache.carbondata.core.util.CarbonUtil;
+import org.apache.carbondata.core.util.path.CarbonTablePath;
+import org.apache.carbondata.format.BlockletInfo3;
+import org.apache.carbondata.format.FileFooter3;
+import org.apache.carbondata.format.FileHeader;
+import org.apache.carbondata.format.TableInfo;
+
+import static org.apache.carbondata.core.constants.CarbonCommonConstants.DEFAULT_CHARSET;
+
+/**
+ * Data Summary command implementation for {@link CarbonCli}
+ */
+class DataSummary {
+ private String dataFolder;
+ private PrintStream out;
+
+ private long numBlock;
+ private long numShard;
+ private long numBlocklet;
+ private long numPage;
+ private long numRow;
+ private long totalDataSize;
+
+ // file path mapping to file object
+ private LinkedHashMap<String, DataFile> dataFiles = new LinkedHashMap<>();
+ private CarbonFile tableStatusFile;
+ private CarbonFile schemaFile;
+
+ DataSummary(String dataFolder, PrintStream out) throws IOException {
+ this.dataFolder = dataFolder;
+ this.out = out;
+ collectDataFiles();
+ }
+
+ private boolean isColumnarFile(String fileName) {
+ // if the timestamp in file name is "0", it is a streaming file
+ return fileName.endsWith(CarbonTablePath.CARBON_DATA_EXT) &&
+ !CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName).equals("0");
+ }
+
+ private boolean isStreamFile(String fileName) {
+ // if the timestamp in file name is "0", it is a streaming file
+ return fileName.endsWith(CarbonTablePath.CARBON_DATA_EXT) &&
+ CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName).equals("0");
+ }
+
+ private void collectDataFiles() throws IOException {
+ Set<String> shards = new HashSet<>();
+ CarbonFile folder = FileFactory.getCarbonFile(dataFolder);
+ List<CarbonFile> files = folder.listFiles(true);
+ List<DataFile> unsortedFiles = new ArrayList<>();
+ for (CarbonFile file : files) {
+ if (isColumnarFile(file.getName())) {
+ DataFile dataFile = new DataFile(file);
+ unsortedFiles.add(dataFile);
+ collectNum(dataFile.getFooter());
+ shards.add(dataFile.getShardName());
+ totalDataSize += file.getSize();
+ } else if (file.getName().endsWith(CarbonTablePath.TABLE_STATUS_FILE)) {
+ tableStatusFile = file;
+ } else if (file.getName().startsWith(CarbonTablePath.SCHEMA_FILE)) {
+ schemaFile = file;
+ } else if (isStreamFile(file.getName())) {
+ out.println("WARN: input path contains streaming file, this tool does not support it yet, "
+ + "skipping it...");
+ }
+ }
+ unsortedFiles.sort((o1, o2) -> {
+ if (o1.getShardName().equalsIgnoreCase(o2.getShardName())) {
+ return Integer.parseInt(o1.getPartNo()) - Integer.parseInt(o2.getPartNo());
+ } else {
+ return o1.getShardName().hashCode() - o2.getShardName().hashCode();
--- End diff --

fixed, will use `o1.getShardName().compareto(o2.getShardName())`

---

[GitHub] carbondata pull request #2683: [CARBONDATA-2916] Add CarbonCli tool for data...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2683#discussion_r216982870

--- Diff: core/src/main/java/org/apache/carbondata/core/util/DataTypeUtil.java ---
@@ -168,6 +168,65 @@ public static Object getMeasureObjectBasedOnDataType(ColumnPage measurePage, int
}
}

+ /**
+ * Calculate data percentage in [min, max] scope based on data type
+ * @param data data to calculate the percentage
+ * @param min min value
+ * @param max max value
+ * @param column column schema including data type
+ * @return result
+ */
+ public static double computePercentage(byte[] data, byte[] min, byte[] max, ColumnSchema column) {
+ if (column.getDataType() == DataTypes.STRING) {
+ // for string, we do not calculate
+ return 0;
+ } else if (DataTypes.isDecimal(column.getDataType())) {
+ BigDecimal minValue = DataTypeUtil.byteToBigDecimal(min);
+ BigDecimal dataValue = DataTypeUtil.byteToBigDecimal(data).subtract(minValue);
+ BigDecimal factorValue = DataTypeUtil.byteToBigDecimal(max).subtract(minValue);
+ return dataValue.divide(factorValue).doubleValue();
+ }
+ double dataValue, minValue, factorValue;
+ if (column.getDataType() == DataTypes.SHORT) {
+ minValue = ByteUtil.toShort(min, 0);
+ dataValue = ByteUtil.toShort(data, 0) - minValue;
+ factorValue = ByteUtil.toShort(max, 0) - ByteUtil.toShort(min, 0);
+ } else if (column.getDataType() == DataTypes.INT) {
+ if (column.isSortColumn()) {
+ minValue = ByteUtil.toXorInt(min, 0, min.length);
+ dataValue = ByteUtil.toXorInt(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toXorInt(max, 0, max.length) - ByteUtil.toXorInt(min, 0, min.length);
+ } else {
+ minValue = ByteUtil.toLong(min, 0, min.length);
+ dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
+ }
+ } else if (column.getDataType() == DataTypes.LONG) {
+ minValue = ByteUtil.toLong(min, 0, min.length);
+ dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
+ } else if (column.getDataType() == DataTypes.DATE) {
+ minValue = ByteUtil.toInt(min, 0, min.length);
+ dataValue = ByteUtil.toInt(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toInt(max, 0, max.length) - ByteUtil.toInt(min, 0, min.length);
+ } else if (column.getDataType() == DataTypes.TIMESTAMP) {
+ minValue = ByteUtil.toLong(min, 0, min.length);
+ dataValue = ByteUtil.toLong(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toLong(max, 0, max.length) - ByteUtil.toLong(min, 0, min.length);
+ } else if (column.getDataType() == DataTypes.DOUBLE) {
+ minValue = ByteUtil.toDouble(min, 0, min.length);
+ dataValue = ByteUtil.toDouble(data, 0, data.length) - minValue;
+ factorValue = ByteUtil.toDouble(max, 0, max.length) - ByteUtil.toDouble(min, 0, min.length);
+ } else {
+ throw new UnsupportedOperationException("data type: " + column.getDataType());
+ }
+
+ if (factorValue == 0d) {
+ return Double.MIN_VALUE;
--- End diff --

fixed

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/254/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user jackylk commented on the issue:

https://github.com/apache/carbondata/pull/2683

@xuchuanyin currently data summary command is designed for human. If a output for machine process is required, I want to first understand what information is needed, maybe it is better to add in as another command for this tool. But first we can discuss on the mail list what information is needed

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/255/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/424/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8494/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user jackylk commented on the issue:

https://github.com/apache/carbondata/pull/2683

retest this please

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/258/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8497/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/427/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/264/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/433/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8503/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/269/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8508/

---

[GitHub] carbondata issue #2683: [CARBONDATA-2916] Add CarbonCli tool for data summar...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2683

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/438/

---

1234