Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

Classic

List

97 messages Options

Options

12345

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

GitHub user xubo245 opened a pull request:

https://github.com/apache/carbondata/pull/2804

[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path

[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path
1.readSchemaInDataFile suppurt read schema from folder path
2.readSchemaInIndexFile suppurt read schema from folder path

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?
No
- [ ] Any backward compatibility impacted?
No
- [ ] Document update required?
No
- [ ] Testing done
add test case
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
https://issues.apache.org/jira/browse/CARBONDATA-2951

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xubo245/carbondata CARBONDATA-2996_SchemaSupportFolder

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2804.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2804

----
commit 4246f5c720e77e31b898119d1499e412af06d810
Author: xubo245 <xubo29@...>
Date: 2018-10-09T07:16:09Z

[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path
1.readSchemaInDataFile suppurt read schema from folder path
2.readSchemaInIndexFile suppurt read schema from folder path

commit b486fec8eaea1954c2a35590e5738af873ab4eaa
Author: xubo245 <xubo29@...>
Date: 2018-10-09T07:24:53Z

support S3

----

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2804

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/746/

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2804

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9012/

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2804

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/944/

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2804

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/757/

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2804

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/758/

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2804

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/956/

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2804

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9024/

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on the issue:

https://github.com/apache/carbondata/pull/2804

@KanakaKumar @kunal642 @jackylk Please review it.

---

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2804#discussion_r228413841

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java ---
@@ -59,11 +60,30 @@ public static Schema readSchemaInSchemaFile(String schemaFilePath) throws IOExce
/**
* Read carbondata file and return the schema
*
- * @param dataFilePath complete path including carbondata file name
+ * @param path complete path including carbondata file name
* @return Schema object
* @throws IOException
*/
- public static Schema readSchemaInDataFile(String dataFilePath) throws IOException {
+ public static Schema readSchemaInDataFile(String path) throws IOException {
+ String dataFilePath = path;
+ if (!(dataFilePath.contains(".carbondata"))) {
+ CarbonFile[] carbonFiles = FileFactory
+ .getCarbonFile(path)
+ .listFiles(new CarbonFileFilter() {
+ @Override
+ public boolean accept(CarbonFile file) {
+ if (file == null) {
+ return false;
+ }
+ return file.getName().endsWith(".carbondata");
+ }
+ });
+ if (carbonFiles == null || carbonFiles.length < 1) {
+ throw new RuntimeException("Carbon data file not exists.");
+ }
+ dataFilePath = carbonFiles[0].getAbsolutePath();
--- End diff --

Taking only one data file (first file) ?

What if this folder has multiple files with different schema. what if user wanted schema info from file also?

Supporting schema read from folder is not required as this is exposed for user and he has the list of files.
a) to read one file, user passes single file for this API. -- already supported
b) to read multiple files, user can list files and pass all the files he want schema and call our API in a list -- already supported.

Just reading first file from folder doesn't make sense. This PR is not required as existing API already support all user scenarios.

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on the issue:

https://github.com/apache/carbondata/pull/2804

@xubo245 :
Just reading first file from folder doesn't make sense. This PR is not required as existing API already support all user scenarios.
please check my comment for more details.

---

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2804#discussion_r228506535

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java ---
@@ -59,11 +60,30 @@ public static Schema readSchemaInSchemaFile(String schemaFilePath) throws IOExce
/**
* Read carbondata file and return the schema
*
- * @param dataFilePath complete path including carbondata file name
+ * @param path complete path including carbondata file name
* @return Schema object
* @throws IOException
*/
- public static Schema readSchemaInDataFile(String dataFilePath) throws IOException {
+ public static Schema readSchemaInDataFile(String path) throws IOException {
+ String dataFilePath = path;
+ if (!(dataFilePath.contains(".carbondata"))) {
+ CarbonFile[] carbonFiles = FileFactory
+ .getCarbonFile(path)
+ .listFiles(new CarbonFileFilter() {
+ @Override
+ public boolean accept(CarbonFile file) {
+ if (file == null) {
+ return false;
+ }
+ return file.getName().endsWith(".carbondata");
+ }
+ });
+ if (carbonFiles == null || carbonFiles.length < 1) {
+ throw new RuntimeException("Carbon data file not exists.");
+ }
+ dataFilePath = carbonFiles[0].getAbsolutePath();
--- End diff --

yes, take the only one data file.
It's more convenient for user give a path to read schemaãand maybe the folder has sub-folderï¼use need list iterativelyãThere are some customer has this problemã
We can judge the different files schema if it's necessaryãSDK can throw exception if multiple files has different schemaã

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on the issue:

https://github.com/apache/carbondata/pull/2804

@ajantha-bhat There are already some user has this problemã Between different servicesï¼they only give the path to otherï¼ the user need list the index/data fileï¼ even though they need list sub-folder iteratively to find the carbon index/data fileï¼ which is not convenient for userã We can make it become public function for all userã

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on the issue:

https://github.com/apache/carbondata/pull/2804

@ajantha-bhat @KanakaKumar Please review again.

---

[GitHub] carbondata issue #2804: [CARBONDATA-2996] CarbonSchemaReader support read sc...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on the issue:

https://github.com/apache/carbondata/pull/2804

@xubo245 : In that case you can implement,

String getFirstCarbonFile(path, ExtenstionType)

and pass it to existing method. ReadSchemaFromFile() must only read it. It should not do any extra work.

---

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2804#discussion_r229173821

--- Diff: store/sdk/src/test/java/org/apache/carbondata/sdk/file/CarbonSchemaReaderTest.java ---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.sdk.file;
+
+import java.io.File;
+import java.util.HashMap;
+import java.util.Map;
+
+import junit.framework.TestCase;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.commons.io.FileUtils;
+import org.junit.Test;
+
+public class CarbonSchemaReaderTest extends TestCase {
+
+ @Test
+ public void testReadSchemaFromDataFile() {
+ String path = "./testWriteFiles";
+ try {
+ FileUtils.deleteDirectory(new File(path));
+
+ Field[] fields = new Field[11];
+ fields[0] = new Field("stringField", DataTypes.STRING);
--- End diff --

write you can move it in the setup() step

---

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2804#discussion_r229173861

--- Diff: store/sdk/src/test/java/org/apache/carbondata/sdk/file/CarbonSchemaReaderTest.java ---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.sdk.file;
+
+import java.io.File;
+import java.util.HashMap;
+import java.util.Map;
+
+import junit.framework.TestCase;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.commons.io.FileUtils;
+import org.junit.Test;
+
+public class CarbonSchemaReaderTest extends TestCase {
+
+ @Test
+ public void testReadSchemaFromDataFile() {
+ String path = "./testWriteFiles";
+ try {
+ FileUtils.deleteDirectory(new File(path));
+
+ Field[] fields = new Field[11];
+ fields[0] = new Field("stringField", DataTypes.STRING);
+ fields[1] = new Field("shortField", DataTypes.SHORT);
+ fields[2] = new Field("intField", DataTypes.INT);
+ fields[3] = new Field("longField", DataTypes.LONG);
+ fields[4] = new Field("doubleField", DataTypes.DOUBLE);
+ fields[5] = new Field("boolField", DataTypes.BOOLEAN);
+ fields[6] = new Field("dateField", DataTypes.DATE);
+ fields[7] = new Field("timeField", DataTypes.TIMESTAMP);
+ fields[8] = new Field("decimalField", DataTypes.createDecimalType(8, 2));
+ fields[9] = new Field("varcharField", DataTypes.VARCHAR);
+ fields[10] = new Field("arrayField", DataTypes.createArrayType(DataTypes.STRING));
+ Map<String, String> map = new HashMap<>();
+ map.put("complex_delimiter_level_1", "#");
+ CarbonWriter writer = CarbonWriter.builder()
+ .outputPath(path)
+ .withLoadOptions(map)
+ .withCsvInput(new Schema(fields)).build();
+
+ for (int i = 0; i < 10; i++) {
+ String[] row2 = new String[]{
+ "robot" + (i % 10),
+ String.valueOf(i % 10000),
+ String.valueOf(i),
+ String.valueOf(Long.MAX_VALUE - i),
+ String.valueOf((double) i / 2),
+ String.valueOf(true),
+ "2019-03-02",
+ "2019-02-12 03:03:34",
+ "12.345",
+ "varchar",
+ "Hello#World#From#Carbon"
+ };
+ writer.write(row2);
+ }
+ writer.close();
+
+ Schema schema = CarbonSchemaReader
+ .readSchemaInDataFile(path)
+ .asOriginOrder();
+ // Transform the schema
+ assertEquals(schema.getFields().length, 11);
+ String[] strings = new String[schema.getFields().length];
+ for (int i = 0; i < schema.getFields().length; i++) {
+ strings[i] = (schema.getFields())[i].getFieldName();
+ }
+ assert (strings[0].equalsIgnoreCase("stringField"));
+ assert (strings[1].equalsIgnoreCase("shortField"));
+ assert (strings[2].equalsIgnoreCase("intField"));
+ assert (strings[3].equalsIgnoreCase("longField"));
+ assert (strings[4].equalsIgnoreCase("doubleField"));
--- End diff --

can move it to a method and use for both the test case.

---

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2804#discussion_r229173916

--- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java ---
@@ -59,11 +60,30 @@ public static Schema readSchemaInSchemaFile(String schemaFilePath) throws IOExce
/**
* Read carbondata file and return the schema
*
- * @param dataFilePath complete path including carbondata file name
+ * @param path complete path including carbondata file name
* @return Schema object
* @throws IOException
*/
- public static Schema readSchemaInDataFile(String dataFilePath) throws IOException {
+ public static Schema readSchemaInDataFile(String path) throws IOException {
+ String dataFilePath = path;
+ if (!(dataFilePath.contains(".carbondata"))) {
+ CarbonFile[] carbonFiles = FileFactory
+ .getCarbonFile(path)
+ .listFiles(new CarbonFileFilter() {
+ @Override
+ public boolean accept(CarbonFile file) {
+ if (file == null) {
+ return false;
+ }
+ return file.getName().endsWith(".carbondata");
+ }
+ });
+ if (carbonFiles == null || carbonFiles.length < 1) {
+ throw new RuntimeException("Carbon data file not exists.");
+ }
+ dataFilePath = carbonFiles[0].getAbsolutePath();
--- End diff --

In that case you can implement,

String getFirstCarbonFile(path, ExtenstionType)

and pass it to existing method. ReadSchemaFromFile() must only read it. It should not do any extra work.

---

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2804#discussion_r229194343

--- Diff: store/sdk/src/test/java/org/apache/carbondata/sdk/file/CarbonSchemaReaderTest.java ---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.sdk.file;
+
+import java.io.File;
+import java.util.HashMap;
+import java.util.Map;
+
+import junit.framework.TestCase;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.commons.io.FileUtils;
+import org.junit.Test;
+
+public class CarbonSchemaReaderTest extends TestCase {
+
+ @Test
+ public void testReadSchemaFromDataFile() {
+ String path = "./testWriteFiles";
+ try {
+ FileUtils.deleteDirectory(new File(path));
+
+ Field[] fields = new Field[11];
+ fields[0] = new Field("stringField", DataTypes.STRING);
--- End diff --

ok, done

---

[GitHub] carbondata pull request #2804: [CARBONDATA-2996] CarbonSchemaReader support ...

In reply to this post by qiuchenjian-2

Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2804#discussion_r229194361

--- Diff: store/sdk/src/test/java/org/apache/carbondata/sdk/file/CarbonSchemaReaderTest.java ---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.sdk.file;
+
+import java.io.File;
+import java.util.HashMap;
+import java.util.Map;
+
+import junit.framework.TestCase;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.commons.io.FileUtils;
+import org.junit.Test;
+
+public class CarbonSchemaReaderTest extends TestCase {
+
+ @Test
+ public void testReadSchemaFromDataFile() {
+ String path = "./testWriteFiles";
+ try {
+ FileUtils.deleteDirectory(new File(path));
+
+ Field[] fields = new Field[11];
+ fields[0] = new Field("stringField", DataTypes.STRING);
+ fields[1] = new Field("shortField", DataTypes.SHORT);
+ fields[2] = new Field("intField", DataTypes.INT);
+ fields[3] = new Field("longField", DataTypes.LONG);
+ fields[4] = new Field("doubleField", DataTypes.DOUBLE);
+ fields[5] = new Field("boolField", DataTypes.BOOLEAN);
+ fields[6] = new Field("dateField", DataTypes.DATE);
+ fields[7] = new Field("timeField", DataTypes.TIMESTAMP);
+ fields[8] = new Field("decimalField", DataTypes.createDecimalType(8, 2));
+ fields[9] = new Field("varcharField", DataTypes.VARCHAR);
+ fields[10] = new Field("arrayField", DataTypes.createArrayType(DataTypes.STRING));
+ Map<String, String> map = new HashMap<>();
+ map.put("complex_delimiter_level_1", "#");
+ CarbonWriter writer = CarbonWriter.builder()
+ .outputPath(path)
+ .withLoadOptions(map)
+ .withCsvInput(new Schema(fields)).build();
+
+ for (int i = 0; i < 10; i++) {
+ String[] row2 = new String[]{
+ "robot" + (i % 10),
+ String.valueOf(i % 10000),
+ String.valueOf(i),
+ String.valueOf(Long.MAX_VALUE - i),
+ String.valueOf((double) i / 2),
+ String.valueOf(true),
+ "2019-03-02",
+ "2019-02-12 03:03:34",
+ "12.345",
+ "varchar",
+ "Hello#World#From#Carbon"
+ };
+ writer.write(row2);
+ }
+ writer.close();
+
+ Schema schema = CarbonSchemaReader
+ .readSchemaInDataFile(path)
+ .asOriginOrder();
+ // Transform the schema
+ assertEquals(schema.getFields().length, 11);
+ String[] strings = new String[schema.getFields().length];
+ for (int i = 0; i < schema.getFields().length; i++) {
+ strings[i] = (schema.getFields())[i].getFieldName();
+ }
+ assert (strings[0].equalsIgnoreCase("stringField"));
+ assert (strings[1].equalsIgnoreCase("shortField"));
+ assert (strings[2].equalsIgnoreCase("intField"));
+ assert (strings[3].equalsIgnoreCase("longField"));
+ assert (strings[4].equalsIgnoreCase("doubleField"));
--- End diff --

ok, done

---

12345