Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

Classic

List

59 messages Options

Options

123

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

GitHub user xuchuanyin opened a pull request:

https://github.com/apache/carbondata/pull/1177

[CARBONDATA-1281] Support multiple temp dirs for writing files while loading data

# Modifications
This feature mainly focus on avoiding disk hot-spot in single massive data loading, changes are made in two parts:

1. randomly choose a yarn local folder to write sort temp file in sort-process;

2.randomly choose a yarn local folder to write carbondata file in write-process.

# Usage

To enable this feature, user should enable `carbon.using.multi.temp.dir=true` and `carbon.use.local.dir=true`.

# Performance
In my case, this feature improves the loading performance from 35M/s/node to 70+M/s/node

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuchuanyin/carbondata local_multiple_temp_dirs_4_load

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/1177.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1177

----
commit ee5e00cdae5c64fcfad7d7cb1b1127dc5bb79bb4
Author: xuchuanyin <[hidden email]>
Date: 2017-07-17T13:44:49Z

Support multiple temp dirs for writing files while loading data

randomly choose a dir to write sort temp files

randomly choose a dir to write carbondata files

----

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

Github user asfgit commented on the issue:

https://github.com/apache/carbondata/pull/1177

Can one of the admins verify this patch?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user asfgit commented on the issue:

https://github.com/apache/carbondata/pull/1177

Can one of the admins verify this patch?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1177

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3092/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1177

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/502/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1177#discussion_r128147381

--- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
@@ -1308,6 +1308,18 @@
public static final String CARBON_LEASE_RECOVERY_RETRY_INTERVAL =
"carbon.lease.recovery.retry.interval";

+ /**
+ * whether to use multi directories when loading data,
+ * the main purpose is to avoid single-disk-hot-spot
+ */
+ @CarbonProperty
+ public static final String CARBON_USING_MULTI_TEMP_DIR = "carbon.using.multi.temp.dir";
+
+ /**
+ * default value for multi temp dir
+ */
+ public static final String CARBON_SUING_MULTI_TEMP_DIR_DEFAULT = "false";
--- End diff --

CARBON_SUING_MULTI_TEMP_DIR_DEFAULT or CARBON_UING_MULTI_TEMP_DIR_DEFAULT ?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on the issue:

https://github.com/apache/carbondata/pull/1177

@xuchuanyin Thanks for your very good PR.
A couple comments:
1. How to enable this feature(how to use), please consider updating the related document accordingly
2. Please consider adding new test cases for this feature (multiple temp dirs)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1177

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/531/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1177

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3122/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/1177

The tests are in another local branch in my repo, I'll test it tomorrow.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1177

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/556/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1177

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3149/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/1177

@chenliang613 Related document are updated and tests are added. Please review.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1177#discussion_r128555999

--- Diff: docs/useful-tips-on-carbondata.md ---
@@ -231,5 +231,6 @@ scenarios. After the completion of POC, some of the configurations impacting the
| spark.executor.instances/spark.executor.cores/spark.executor.memory | spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, and memory used for CarbonData query. | In the bank scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good performance. This 2 value does not mean more the better. It needs to be configured properly in case of limited resources. For example, In the bank scenario, it has enough CPU 32 cores each node but less memory 64 GB each node. So we cannot give more CPU but less memory. For example, when 4 cores and 12GB for each executor. It sometimes happens GC during the query which impact the query performance very much from the 3 second to more than 15 seconds. In this scenario need to increase the memory or decrease the CPU cores. |
| carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading | The buffer size to store records, returned from the block scan. | In limit scenario this parameter is very important. For example your query limit is 1000. But if we set this value to 3000 that means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the performance increase about 2 times in comparison to if we set this value to 12000. |
| carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use YARN local directories for multi-table load disk load balance | If this is set it to true CarbonData will use YARN local directories for multi-table load disk load balance, that will improve the data load performance. |
+| carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data loading | Whether to use multiple YARN local directories during table data loading for disk load balance | After enabling 'carbon.use.local.dir', if this is set to true, CarbonData will use YARN local directories during data load for disk load balance, that will improve the data load performance. Please enable this property especially when you encounter disk hotspot problem during data loading. |
--- End diff --

So , by default, you propose to set "true" or "false"?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1177#discussion_r128556528

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/sort/unsafe/UnsafeSortDataRows.java ---
@@ -345,9 +350,12 @@ public DataSorterAndWriter(UnsafeCarbonRowPage rowPage) {
}
if (rowPage.isSaveToDisk()) {
// create a new file every time
+ // create a new file and pick a temp directory randomly every time
--- End diff --

please merge the two comments into one

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1177#discussion_r128558208

--- Diff: integration/spark-common-test/src/test/resources/multi_dir/data.csv ---
@@ -0,0 +1,10 @@
+1,name1
+2,name2
--- End diff --

Can you reuse the current test csv file, don't suggest adding new csv file.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1177#discussion_r128560327

--- Diff: integration/spark-common-test/src/test/scala/org/apache/spark/util/SparkUtil4Test.scala ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import org.apache.spark.SparkConf
+
+/**
+ * This class is for accessing utils in spark package for tests
+ */
+object SparkUtil4Test {
--- End diff --

Whether can merge "SparkUtil4Test.scala's functions " into TestLoadDataWithYarnLocalDirs.scala,or not ? because only be used by TestLoadDataWithYarnLocalDirs.scala.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1177#discussion_r128664527

--- Diff: integration/spark-common-test/src/test/resources/multi_dir/data.csv ---
@@ -0,0 +1,10 @@
+1,name1
+2,name2
--- End diff --

OK, I'll find suitable one

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1177#discussion_r128665149

--- Diff: docs/useful-tips-on-carbondata.md ---
@@ -231,5 +231,6 @@ scenarios. After the completion of POC, some of the configurations impacting the
| spark.executor.instances/spark.executor.cores/spark.executor.memory | spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, and memory used for CarbonData query. | In the bank scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good performance. This 2 value does not mean more the better. It needs to be configured properly in case of limited resources. For example, In the bank scenario, it has enough CPU 32 cores each node but less memory 64 GB each node. So we cannot give more CPU but less memory. For example, when 4 cores and 12GB for each executor. It sometimes happens GC during the query which impact the query performance very much from the 3 second to more than 15 seconds. In this scenario need to increase the memory or decrease the CPU cores. |
| carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading | The buffer size to store records, returned from the block scan. | In limit scenario this parameter is very important. For example your query limit is 1000. But if we set this value to 3000 that means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the performance increase about 2 times in comparison to if we set this value to 12000. |
| carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use YARN local directories for multi-table load disk load balance | If this is set it to true CarbonData will use YARN local directories for multi-table load disk load balance, that will improve the data load performance. |
+| carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data loading | Whether to use multiple YARN local directories during table data loading for disk load balance | After enabling 'carbon.use.local.dir', if this is set to true, CarbonData will use YARN local directories during data load for disk load balance, that will improve the data load performance. Please enable this property especially when you encounter disk hotspot problem during data loading. |
--- End diff --

1. The default value is false, just the same as the property `carbon.use.local.dir`

2. Currently it only works in `LOAD DATA INPATH` (that's my use case). In other use case, suck like update/merge/load-using-global-sort, I haven't studied and changed the related code, so leave this property `default` will not change the existing behavior.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #1177: [CARBONDATA-1281] Support multiple temp dirs for wri...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1177

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3160/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

123