Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

Classic

List

21 messages Options

Options

12

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

GitHub user ravipesala opened a pull request:

https://github.com/apache/carbondata/pull/2661

[CARBONDATA-2888] Support multi level subfolder for SDK read and fileformat read

This PR supports multi-level subfolders read for SDK Reader and spark's carbon fileformat reader.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?

- [ ] Any backward compatibility impacted?

- [ ] Document update required?

- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata sdk-multi-folder-support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2661.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2661

----
commit a71189330b223a76c0b40ee8abf28ce2fd9733c1
Author: ravipesala <ravi.pesala@...>
Date: 2018-08-26T17:17:14Z

Support multi level subfolder for SDK read and fileformat read

----

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2661

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6413/

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2661

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8072/

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2661

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/14/

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r212897009

--- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/SegmentIndexFileStore.java ---
@@ -338,6 +338,24 @@ private MergedBlockIndex readMergeBlockIndex(ThriftReader thriftReader) throws I
});
}

+ /**
+ * List all the index files of the segment.
+ *
+ * @param carbonFile directory
+ */
+ public static void getCarbonIndexFilesRecursively(CarbonFile carbonFile,
--- End diff --

Why not return the list instead of passing it as a parameter?

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r212897508

--- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/CarbonTable.java ---
@@ -268,6 +257,18 @@ public boolean accept(CarbonFile file) {
return CarbonTable.buildFromTableInfo(tableInfoInfer);
}

+ private static CarbonFile getFirstIndexFile(CarbonFile tablePath) {
+ CarbonFile[] carbonFiles = tablePath.listFiles();
+ for (CarbonFile carbonFile : carbonFiles) {
+ if (carbonFile.isDirectory()) {
+ return getFirstIndexFile(carbonFile);
+ } else if (carbonFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) {
+ return carbonFile;
+ }
+ }
+ return null;
--- End diff --

why not just throw exception if the file does not exist?

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r213570439

--- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
@@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
fileIndex
}
}
+
+ /**
+ * Get datafolders recursively
+ */
+ private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
--- End diff --

Is the input CarbonFile a table folder? If so, please rename tableFolder

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r213574978

--- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
@@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
fileIndex
}
}
+
+ /**
+ * Get datafolders recursively
+ */
+ private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
+ val files = carbonFile.listFiles()
+ var folders: Seq[CarbonFile] = Seq()
+ files.foreach { f =>
+ if (f.isDirectory) {
+ val files = f.listFiles()
+ if (files.nonEmpty && !files(0).isDirectory) {
+ folders = Seq(f) ++ folders
--- End diff --

Will this create too many immutable object? Would ArrayBuffer better?

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r214273342

--- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
@@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
fileIndex
}
}
+
+ /**
+ * Get datafolders recursively
+ */
+ private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
+ val files = carbonFile.listFiles()
+ var folders: Seq[CarbonFile] = Seq()
+ files.foreach { f =>
+ if (f.isDirectory) {
+ val files = f.listFiles()
+ if (files.nonEmpty && !files(0).isDirectory) {
+ folders = Seq(f) ++ folders
+ } else {
+ folders = getDataFolders(f) ++ folders
--- End diff --

This statement can be moved under files.nonEmpty check

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r214737914

--- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/SegmentIndexFileStore.java ---
@@ -338,6 +338,24 @@ private MergedBlockIndex readMergeBlockIndex(ThriftReader thriftReader) throws I
});
}

+ /**
+ * List all the index files of the segment.
+ *
+ * @param carbonFile directory
+ */
+ public static void getCarbonIndexFilesRecursively(CarbonFile carbonFile,
--- End diff --

Its a recursive call, and it adds file to the passed list if it finds any.

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r214737987

--- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/CarbonTable.java ---
@@ -268,6 +257,18 @@ public boolean accept(CarbonFile file) {
return CarbonTable.buildFromTableInfo(tableInfoInfer);
}

+ private static CarbonFile getFirstIndexFile(CarbonFile tablePath) {
+ CarbonFile[] carbonFiles = tablePath.listFiles();
+ for (CarbonFile carbonFile : carbonFiles) {
+ if (carbonFile.isDirectory()) {
+ return getFirstIndexFile(carbonFile);
+ } else if (carbonFile.getName().endsWith(CarbonTablePath.INDEX_FILE_EXT)) {
+ return carbonFile;
+ }
+ }
+ return null;
--- End diff --

It is handled in caller method

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r214738056

--- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
@@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
fileIndex
}
}
+
+ /**
+ * Get datafolders recursively
+ */
+ private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
--- End diff --

ok

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r214738731

--- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
@@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
fileIndex
}
}
+
+ /**
+ * Get datafolders recursively
+ */
+ private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
+ val files = carbonFile.listFiles()
+ var folders: Seq[CarbonFile] = Seq()
+ files.foreach { f =>
+ if (f.isDirectory) {
+ val files = f.listFiles()
+ if (files.nonEmpty && !files(0).isDirectory) {
+ folders = Seq(f) ++ folders
--- End diff --

ok

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2661

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8273/

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2661

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/202/

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2661

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/207/

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2661

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8278/

---

[GitHub] carbondata pull request #2661: [CARBONDATA-2888] Support multi level subfold...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2661#discussion_r214899510

--- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonFileIndexReplaceRule.scala ---
@@ -82,4 +82,23 @@ class CarbonFileIndexReplaceRule extends Rule[LogicalPlan] {
fileIndex
}
}
+
+ /**
+ * Get datafolders recursively
+ */
+ private def getDataFolders(carbonFile: CarbonFile): Seq[CarbonFile] = {
+ val files = carbonFile.listFiles()
+ var folders: Seq[CarbonFile] = Seq()
+ files.foreach { f =>
+ if (f.isDirectory) {
+ val files = f.listFiles()
+ if (files.nonEmpty && !files(0).isDirectory) {
+ folders = Seq(f) ++ folders
+ } else {
+ folders = getDataFolders(f) ++ folders
--- End diff --

In the next call to getDataFolders() for empty folder it skips the iteration. So, no problem with the logic. you can ignore the comment.

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on the issue:

https://github.com/apache/carbondata/pull/2661

LGTM

---

[GitHub] carbondata issue #2661: [CARBONDATA-2888] Support multi level subfolder for ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on the issue:

https://github.com/apache/carbondata/pull/2661

LGTM

---

12