GitHub user zzcclp opened a pull request:
https://github.com/apache/carbondata/pull/1245 [CARBONDATA-1366]Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort My testing env and configs are as followings: **Env:** 6 executors, 9G mem + 6 cores per executor **Configs:** SINGLE_PASS=true SORT_SCOPE=GLOBAL_SORT spark.memory.fraction=0.5 if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK_SER)' in method 'org.apache.carbondata.spark.load.DataLoadProcessBuilderOnSpark.loadDataUsingGlobalSort', it takes about **7.2** min to load 144136697 lines (10.9 G parquet files), and if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK)', it takes about **9.5** min to load 144136697 lines. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zzcclp/carbondata CARBONDATA-1366 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/1245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1245 ---- commit b50a3bb590d53e7749665e2f20b6a78ca963561f Author: Zhang Zhichao <[hidden email]> Date: 2017-08-08T09:42:40Z [CARBONDATA-1366]Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort Change rdd storage level to 'MEMORY_AND_DISK_SER' to improve loading performance when sort_scope=global_sort ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1245 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/1245 Would it be better to make a configuration (or read configuration from spark) for thisï¼cc @jackylk --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user watermen commented on the issue:
https://github.com/apache/carbondata/pull/1245 @jackylk Agreed with @xuchuanyin, Sparkâs storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. So different environment correspond to different storage level. So here we'd better make a conf named 'storage_level', and the default value of it is MEMORY_ONLY(the same to default value in spark). We can get more info about storage level in spark website.(http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user zzcclp commented on the issue:
https://github.com/apache/carbondata/pull/1245 @jackylk @watermen @xuchuanyin thanks for your suggestion, I have added option 'carbon.global.sort.rdd.storage.level', please review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user zzcclp commented on the issue:
https://github.com/apache/carbondata/pull/1245 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1245 SDV Build Success with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/157/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/1245 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/1245 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1245 SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/163/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1245 SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/166/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user zzcclp commented on the issue:
https://github.com/apache/carbondata/pull/1245 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user zzcclp commented on the issue:
https://github.com/apache/carbondata/pull/1245 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user zzcclp commented on the issue:
https://github.com/apache/carbondata/pull/1245 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1245 SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/171/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1245 SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/170/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1245 SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/172/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1245#discussion_r132364478 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java --- @@ -876,6 +876,24 @@ public boolean isUseMultiTempDir() { } /** + * Return valid storage level + * @return String + */ + public String getGlobalSortRddStorageLevel() { + String storageLevel = getProperty(CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL, + CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL_DEFAULT); + boolean validateStorageLevel = CarbonUtil.isValidStorageLevel(storageLevel); + if (!validateStorageLevel) { + LOGGER.info("The " + CarbonCommonConstants.CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL --- End diff -- change to error log --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1245#discussion_r132364666 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java --- @@ -1714,6 +1714,45 @@ public static boolean isValidSortOption(String sortScope) { } /** + * validate the storage level + * @param storageLevel + * @return boolean + */ + public static boolean isValidStorageLevel(String storageLevel) { + if (null == storageLevel || storageLevel.trim().equals("")) { + return false; + } + switch (storageLevel.toUpperCase()) { + case "DISK_ONLY": + return true; --- End diff -- I think for all cases you can keep one `return true` is enough --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1245#discussion_r132364955 --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java --- @@ -1304,6 +1304,18 @@ */ public static final String CARBON_USE_MULTI_TEMP_DIR_DEFAULT = "false"; + /** + * Which storage level to persist rdd when sort_scope=global_sort + */ + @CarbonProperty + public static final String CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL = + "carbon.global.sort.rdd.storage.level"; + + /** + * default value for carbon.global.sort.rdd.storage.level --- End diff -- In the comment, mention that the default value is designed for machine with big memory, if user environment has less memory, set the `CARBON_GLOBAL_SORT_RDD_STORAGE_LEVEL` according to `MEMORY_AND_DISK` (see spark recommendation) and give a URL in the comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
Free forum by Nabble | Edit this page |