Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]Change rdd storage level to ...

Classic

List

26 messages Options

Options

12

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]Change rdd storage level to ...

GitHub user zzcclp opened a pull request:

https://github.com/apache/carbondata/pull/1245

[CARBONDATA-1366]Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort

My testing env and configs are as followings:

**Env:**
6 executors, 9G mem + 6 cores per executor

**Configs:**
SINGLE_PASS=true
SORT_SCOPE=GLOBAL_SORT
spark.memory.fraction=0.5

if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK_SER)' in method 'org.apache.carbondata.spark.load.DataLoadProcessBuilderOnSpark.loadDataUsingGlobalSort', it takes about **7.2** min to load 144136697 lines (10.9 G parquet files), and if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK)', it takes about **9.5** min to load 144136697 lines.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zzcclp/carbondata CARBONDATA-1366

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/1245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1245

----
commit b50a3bb590d53e7749665e2f20b6a78ca963561f
Author: Zhang Zhichao <[hidden email]>
Date: 2017-08-08T09:42:40Z

[CARBONDATA-1366]Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort

Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort

----

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]Change rdd storage level to 'MEMORY...

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1245

Can one of the admins verify this patch?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]Change rdd storage level to 'MEMORY...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/1245

Would it be better to make a configuration (or read configuration from spark) for thisï¼cc @jackylk

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]Change rdd storage level to 'MEMORY...

In reply to this post by qiuchenjian-2

Github user watermen commented on the issue:

https://github.com/apache/carbondata/pull/1245

@jackylk Agreed with @xuchuanyin, Sparkâs storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. So different environment correspond to different storage level. So here we'd better make a conf named 'storage_level', and the default value of it is MEMORY_ONLY(the same to default value in spark). We can get more info about storage level in spark website.(http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]Change rdd storage level to 'MEMORY...

In reply to this post by qiuchenjian-2

Github user zzcclp commented on the issue:

https://github.com/apache/carbondata/pull/1245

@jackylk @watermen @xuchuanyin thanks for your suggestion, I have added option 'carbon.global.sort.rdd.storage.level', please review.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user zzcclp commented on the issue:

https://github.com/apache/carbondata/pull/1245

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1245

SDV Build Success with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/157/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on the issue:

https://github.com/apache/carbondata/pull/1245

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on the issue:

https://github.com/apache/carbondata/pull/1245

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1245

SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/163/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1245

SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/166/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user zzcclp commented on the issue:

https://github.com/apache/carbondata/pull/1245

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user zzcclp commented on the issue:

https://github.com/apache/carbondata/pull/1245

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user zzcclp commented on the issue:

https://github.com/apache/carbondata/pull/1245

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1245

SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/171/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1245

SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/170/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1245: [CARBONDATA-1366]add an option 'carbon.global.sort.r...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1245

SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/172/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]add an option 'carbon.global...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1245#discussion_r132364478

--- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java ---
@@ -876,6 +876,24 @@ public boolean isUseMultiTempDir() {
}

/**
+ * Return valid storage level
+ * @return String
+ */
+ public String getGlobalSortRddStorageLevel() {
+ String storageLevel = getProperty(CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL,
+ CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL_DEFAULT);
+ boolean validateStorageLevel = CarbonUtil.isValidStorageLevel(storageLevel);
+ if (!validateStorageLevel) {
+ LOGGER.info("The " + CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL
--- End diff --

change to error log

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]add an option 'carbon.global...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1245#discussion_r132364666

--- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java ---
@@ -1714,6 +1714,45 @@ public static boolean isValidSortOption(String sortScope) {
}

/**
+ * validate the storage level
+ * @param storageLevel
+ * @return boolean
+ */
+ public static boolean isValidStorageLevel(String storageLevel) {
+ if (null == storageLevel || storageLevel.trim().equals("")) {
+ return false;
+ }
+ switch (storageLevel.toUpperCase()) {
+ case "DISK_ONLY":
+ return true;
--- End diff --

I think for all cases you can keep one `return true` is enough

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1245: [CARBONDATA-1366]add an option 'carbon.global...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1245#discussion_r132364955

--- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
@@ -1304,6 +1304,18 @@
*/
public static final String CARBON_USE_MULTI_TEMP_DIR_DEFAULT = "false";

+ /**
+ * Which storage level to persist rdd when sort_scope=global_sort
+ */
+ @CarbonProperty
+ public static final String CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL =
+ "carbon.global.sort.rdd.storage.level";
+
+ /**
+ * default value for carbon.global.sort.rdd.storage.level
--- End diff --

In the comment, mention that the default value is designed for machine with big memory, if user environment has less memory, set the `CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL` according to `MEMORY_AND_DISK` (see spark recommendation) and give a URL in the comment

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

12