Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] incubator-carbondata pull request #620: [WIP]Added batch sort to improve the...

Classic

List

61 messages Options

Options

1234

[GitHub] incubator-carbondata pull request #620: [WIP]Added batch sort to improve the...

GitHub user ravipesala opened a pull request:

https://github.com/apache/incubator-carbondata/pull/620

[WIP]Added batch sort to improve the loading performance

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata batch-sort

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/620.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #620

----
commit e440bce45913ea3a643ac647d245b130f73db3dd
Author: ravipesala <[hidden email]>
Date: 2017-03-01T16:27:32Z

Added batch sort to improve the loading performance

----

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/987/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/988/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/989/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/990/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/991/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/994/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/995/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/998/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [WIP]Added batch sort to improve the loadin...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/999/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742]Added batch sort to ...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r104316373

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/sort/impl/UnsafeBatchParallelReadMergeSorterImpl.java ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.sort.impl;
+
+import java.util.Iterator;
+import java.util.List;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.concurrent.atomic.AtomicLong;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.util.CarbonProperties;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.newflow.row.CarbonSortBatch;
+import org.apache.carbondata.processing.newflow.sort.Sorter;
+import org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage;
+import org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows;
+import org.apache.carbondata.processing.newflow.sort.unsafe.merger.UnsafeIntermediateMerger;
+import org.apache.carbondata.processing.newflow.sort.unsafe.merger.UnsafeSingleThreadFinalSortFilesMerger;
+import org.apache.carbondata.processing.sortandgroupby.exception.CarbonSortKeyAndGroupByException;
+import org.apache.carbondata.processing.sortandgroupby.sortdata.SortParameters;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+
+/**
+ * It parallely reads data from array of iterates and do merge sort.
+ * It sorts data in batches and send to the next step.
+ */
+public class UnsafeBatchParallelReadMergeSorterImpl implements Sorter {
--- End diff --

Is this sorter still doing merge? I though it should do in-memory sort only

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742]Added batch sort to ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r104385171

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/sort/impl/UnsafeBatchParallelReadMergeSorterImpl.java ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.sort.impl;
+
+import java.util.Iterator;
+import java.util.List;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.concurrent.atomic.AtomicLong;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.util.CarbonProperties;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.newflow.row.CarbonSortBatch;
+import org.apache.carbondata.processing.newflow.sort.Sorter;
+import org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage;
+import org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows;
+import org.apache.carbondata.processing.newflow.sort.unsafe.merger.UnsafeIntermediateMerger;
+import org.apache.carbondata.processing.newflow.sort.unsafe.merger.UnsafeSingleThreadFinalSortFilesMerger;
+import org.apache.carbondata.processing.sortandgroupby.exception.CarbonSortKeyAndGroupByException;
+import org.apache.carbondata.processing.sortandgroupby.sortdata.SortParameters;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+
+/**
+ * It parallely reads data from array of iterates and do merge sort.
+ * It sorts data in batches and send to the next step.
+ */
+public class UnsafeBatchParallelReadMergeSorterImpl implements Sorter {
--- End diff --

Yes we do sort in-memory, it sorts the data chunk by chunk (default size 64 MB) and kept them in memory, once the batch memory reaches then it starts merge sort and gives to the data writer. This approach is faster than sort the big batch of records once.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [CARBONDATA-742]Added batch sort to improve...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

Data size -> 100 million records
**DDL and Queries for test**
CREATE TABLE perftesta (c1 string,c2 string,c3 string,c4 string,c5 string,c6 bigint,c7 double,c8 int,c9 double,c10 double) STORED BY 'carbondata';

Q1 -> select count(*) from perftesta;
Q2 -> SELECT c3, c4, sum(c8) FROM perftesta WHERE c1 = 'P1_24521' GROUP BY c3, c4;
Q3 -> SELECT c2, c5, count(distinct c1), sum(c7) FROM perftesta WHERE c4="P4_4" and c5="P5_7" and c8>4 GROUP BY c2, c5;
Q4 -> SELECT c2, c5, count(distinct c1), sum(c7) FROM perftesta WHERE c4="P4_4" and c5="P5_7" GROUP BY c2, c5;
Q5 -> SELECT c4 FROM perftesta WHERE c1="P1_24521";
Q6 -> SELECT * FROM perftesta WHERE c2="P2_43";
Q7 -> SELECT sum(c7), sum(c8), avg(c9), max(c10) FROM perftesta;
Q8 -> SELECT sum(c7), sum(c8), sum(9), sum(c10) FROM perftesta WHERE c2="P2_75" and c6<5;
Q9 -> SELECT sum(c7), sum(c8), sum(9), sum(c10) FROM perftesta WHERE c2="P2_75";
Q10 -> SELECT count(c1),count(c2),count(c3),count(c4),count(c5),count(c6),count(c7),count(c8),count(c9),count(c10) FROM perftesta;

**With Batch Sort**
Load with inmemory size 1GB(with unsafe sort) so batch size will be ~450MB --> Time : 324 seconds
Total blocks created 14 files with each 105MB

Query(first reading, second reading)
Q1 (6.577, 3.404)
Q2 (3.414, 1.639)
Q3 (8.552, 7.572)
Q4 (5.033, 3.875)
Q5 (0.616, 0.456)
Q6 (7.978, 7.682)
Q7 (3.985, 2.909)
Q8 (8.93, 8.697)
Q9 (3.606, 3.305)
Q10 (8.51, 8.367)

**With complete sort (old flow)**
Load with inmemory size 1GB with unsafe sort --> Time : 430 seconds
Total blocks created 2 files with 920MB and 560MB

Query(first reading, second reading)
Q1 (7.473,2.254)
Q2 (2.635, 0.678)
Q3 (11.411, 9.322)
Q4 (4.422, 3.883)
Q5 (0.332,0.22)
Q6 (8.580, 8.187)
Q7 (4.364, 3.617)
Q8 (12.033, 12.138)
Q9 (3.622, 3.695)
Q10 (8.39, 8.941)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [CARBONDATA-742]Added batch sort to improve...

In reply to this post by qiuchenjian-2

Github user jackylk commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

With Batch Sort:
How many batch is processed within 14 files? Basically I wanted to know how many more B tree is created comparing to With complete sort approach.
It is a bit strange that Q3 and Q8 is faster with Batch Sort.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [CARBONDATA-742]Added batch sort to improve...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

For one batch, one file/Btree is created. Since I have given the batch size as 450MB it can do batch process for every 450 MB of collected data. If I give my batch size as 900MB then it creates only 7 files. The files/Btrees will get reduce as you increase the batch size.

Yes, even I have noticed that some times spark can do process well if the blocks are small. That might be the reason Q3 and Q8 is faster.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #620: [CARBONDATA-742]Added batch sort to improve...

In reply to this post by qiuchenjian-2

Github user chenliang613 commented on the issue:

https://github.com/apache/incubator-carbondata/pull/620

please change the title as per the format: [CARBONDATA-issue number>] Title of the pull request (need to add a blank)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105838407

--- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
@@ -1149,6 +1149,19 @@

public static final String USE_KETTLE_DEFAULT = "false";

+ /**
+ * Sorts the data in batches and writes the batch data to store with index file.
+ */
+ public static final String LOAD_USE_BATCH_SORT = "carbon.load.use.batch.sort";
+
+ public static final String LOAD_USE_BATCH_SORT_DEFAULT = "true";
+
+ /**
+ * Size of batch data to keep in memory, as a thumb rule it supposed
+ * to be less than 45% of sort.inmemory.size.inmb otherwise it may spill intermediate data to disk
+ */
+ public static final String LOAD_BATCH_SORT_SIZE_INMB = "carbon.load.batch.sort.size.inmb";
--- End diff --

I think it is better to move it up nearby where `IN_MEMORY_FOR_SORT_DATA_IN_MB` is

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105838661

--- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
@@ -1149,6 +1149,19 @@

public static final String USE_KETTLE_DEFAULT = "false";

+ /**
+ * Sorts the data in batches and writes the batch data to store with index file.
+ */
+ public static final String LOAD_USE_BATCH_SORT = "carbon.load.use.batch.sort";
+
+ public static final String LOAD_USE_BATCH_SORT_DEFAULT = "true";
--- End diff --

Better to make it `false` as of now, and revisit it after checking the performance and loading and reading.
And please add comment mentioning something like "if fast loading is favored, set it to true. If fast query performance is favored, set it to false"

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105839493

--- Diff: core/src/main/java/org/apache/carbondata/core/memory/CarbonUnsafe.java ---
@@ -34,6 +35,9 @@

public static final int DOUBLE_ARRAY_OFFSET;

+ public static final boolean LITTLEENDIAN =
--- End diff --

rename to `isLittleEndian`

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

1234