Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #978: Cover the case when last page is not be consum...

Classic

List

22 messages Options

Options

12

[GitHub] carbondata pull request #978: Cover the case when last page is not be consum...

GitHub user watermen opened a pull request:

https://github.com/apache/carbondata/pull/978

Cover the case when last page is not be consumed at the end

First, we use Producer-Consumer model in the write step, we have n(default value is 2 and can be configured) producers and one consumer. The task of generate last page(less than 32000) is added to thread pool at the end, but can't be guaranteed to be finished and add to BlockletDataHolder at the end. Because we have n tasks running concurrently.
Second, we have 2 ways to invoke `writeDataToFile`, one is the size of `DataWriterHolder` reach the size of blocklet and two is the page is the last page.
So if the last page is not be consumed at the end, we lost the page which be consumed after last page.
This PR add a flag named isLastPageWrited to make sure every page is writed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/watermen/incubator-carbondata CARBONDATA-1109

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/978.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #978

----
commit 6e34389cbb011734078d8c2431065d1f04fc891f
Author: Yadong Qi <[hidden email]>
Date: 2017-05-31T09:35:45Z

Cover the case when last page is not be consumed at the end.

----

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

Github user kumarvishal09 commented on the issue:

https://github.com/apache/carbondata/pull/978

@watermen Can u please add a testcase

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user watermen commented on the issue:

https://github.com/apache/carbondata/pull/978

test it please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user watermen commented on the issue:

https://github.com/apache/carbondata/pull/978

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user watermen commented on the issue:

https://github.com/apache/carbondata/pull/978

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user watermen commented on the issue:

https://github.com/apache/carbondata/pull/978

@kumarvishal09 It is very hard to write the testcase because this case is happened occasionally.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user QiangCai commented on the issue:

https://github.com/apache/carbondata/pull/978

We need make sure all pages of a blocklet are in the right order to be written to carbon data file.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/978

@watermen
This fix is contradictory to the design of producer and consumer. The producer process the data and puts in holder in sequence and consumer takes the data in the same sequence. So for the last set of data there is already a boolean flag `writeAll` inside `NodeHolder` enabled to write all the remaining data.
So this fix should not solve the issue, if it solves then there is an issue in producer/consumer design. Please provide the reproduce steps of this issue in the jira.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user watermen commented on the issue:

https://github.com/apache/carbondata/pull/978

@ravipesala You can reproduce this case by add some log and run loading sample.csv testcase(TestDataLoadWithFileName).
It's hard to reproducer, so I add sleep in Producer to simulate the time of processing data.
![image](https://cloud.githubusercontent.com/assets/1400819/26669092/19e9535e-46df-11e7-9d02-c12b05bb4a90.png)
![image](https://cloud.githubusercontent.com/assets/1400819/26669073/0c56c64a-46df-11e7-8071-6ba2e7fa5785.png)
The log lost page like below:
```
00:15:42 [Thread-65]###addDataToStore
00:15:42 [Thread-65]###addDataToStore
00:15:43 [Thread-73]###Put ---> isWriteAll:false index:1
00:15:44 [Thread-72]###Put ---> isWriteAll:false index:0
00:15:44 [Thread-71]###Get ---> isWriteAll:false index:0
00:15:44 [Thread-71]###Get ---> isWriteAll:false index:1
00:15:44 [Thread-65]###addDataToStore
00:15:44 [Thread-65]###addDataToStore
00:15:44 [Thread-65]###finish
00:15:46 [Thread-72]###Put ---> isWriteAll:false index:1
00:15:46 [Thread-72]###Put ---> isWriteAll:true index:0
00:15:46 [Thread-71]###Get ---> isWriteAll:true index:0// Last page is not be consumed at the end.
00:15:46 [Thread-71]###Get ---> isWriteAll:false index:1
00:15:47 [Thread-73]###Put ---> isWriteAll:false index:0
00:15:47 [Thread-71]###Get ---> isWriteAll:false index:0
```

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/978

@watermen Thank you for giving reproduce steps, I can reproduce it.
But the fix you have given is not right, This issue happens because of missing `semaphore.acquire()` in `finish` method.
So solution is just adding 'semaphore.acquire()' in starting of `finish()` method solves the issue. Please update the PR.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Cover the case when last page is no...

In reply to this post by qiuchenjian-2

Github user watermen commented on the issue:

https://github.com/apache/carbondata/pull/978

@ravipesala Thanks for your solution, PR updated.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Acquire semaphore before submit a p...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/978

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2140/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata pull request #978: [CARBONDATA-1109] Acquire semaphore before sub...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/978#discussion_r119785140

--- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ---
@@ -488,13 +488,19 @@ private NodeHolder processDataRows(List<CarbonRow> dataRows)
public void finish() throws CarbonDataWriterException {
// still some data is present in stores if entryCount is more
// than 0
- producerExecutorServiceTaskList.add(producerExecutorService
- .submit(new Producer(blockletDataHolder, dataRows, ++writerTaskSequenceCounter, true)));
- blockletProcessingCount.incrementAndGet();
- processedDataCount += entryCount;
- closeWriterExecutionService(producerExecutorService);
- processWriteTaskSubmitList(producerExecutorServiceTaskList);
- processingComplete = true;
+ try {
+ semaphore.acquire();
--- End diff --

@watermen .....here a check is required for entryCount > 0.....because we need to acquire a semaphore lock only if total number of rows in raw data are not exactly divisible by page size...in this case only we will have some extra rows to be process by finish method else addDataToStore method will handle all the rows....Please refer the below code snippet....

public void finish() throws CarbonDataWriterException {
// still some data is present in stores if entryCount is more
// than 0
if (this.entryCount > 0) {
try {
semaphore.acquire();
producerExecutorServiceTaskList.add(producerExecutorService
.submit(new Producer(blockletDataHolder, dataRows, ++writerTaskSequenceCounter, true)));
blockletProcessingCount.incrementAndGet();
processedDataCount += entryCount;
} catch (InterruptedException e) {
LOGGER.error(e, e.getMessage());
throw new CarbonDataWriterException(e.getMessage(), e);
}
}
closeWriterExecutionService(producerExecutorService);
processWriteTaskSubmitList(producerExecutorServiceTaskList);
processingComplete = true;

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Acquire semaphore before submit a p...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/978

Not required any check, entry count check handled inside

On Fri, 2 Jun 2017 at 11:13 AM, manishgupta88 <[hidden email]>
wrote:

> *@manishgupta88* commented on this pull request.
> ------------------------------
>
> In
> processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java
> <https://github.com/apache/carbondata/pull/978#discussion_r119785140>:
>
> > @@ -488,13 +488,19 @@ private NodeHolder processDataRows(List<CarbonRow> dataRows)
> public void finish() throws CarbonDataWriterException {
> // still some data is present in stores if entryCount is more
> // than 0
> - producerExecutorServiceTaskList.add(producerExecutorService
> - .submit(new Producer(blockletDataHolder, dataRows, ++writerTaskSequenceCounter, true)));
> - blockletProcessingCount.incrementAndGet();
> - processedDataCount += entryCount;
> - closeWriterExecutionService(producerExecutorService);
> - processWriteTaskSubmitList(producerExecutorServiceTaskList);
> - processingComplete = true;
> + try {
> + semaphore.acquire();
>
> @watermen <https://github.com/watermen> .....here a check is required for
> entryCount > 0.....because we need to acquire a semaphore lock only if
> total number of rows in raw data are not exactly divisible by page
> size...in this case only we will have some extra rows to be process by
> finish method else addDataToStore method will handle all the rows....Please
> refer the below code snippet....
>
> public void finish() throws CarbonDataWriterException {
> // still some data is present in stores if entryCount is more
> // than 0
> if (this.entryCount > 0) {
> try {
> semaphore.acquire();
> producerExecutorServiceTaskList.add(producerExecutorService
> .submit(new Producer(blockletDataHolder, dataRows,
> ++writerTaskSequenceCounter, true)));
> blockletProcessingCount.incrementAndGet();
> processedDataCount += entryCount;
> } catch (InterruptedException e) {
> LOGGER.error(e, e.getMessage());
> throw new CarbonDataWriterException(e.getMessage(), e);
> }
> }
> closeWriterExecutionService(producerExecutorService);
> processWriteTaskSubmitList(producerExecutorServiceTaskList);
> processingComplete = true;
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/carbondata/pull/978#pullrequestreview-41695875>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AHwdxhu_fSOnCMiMXsBYK45_C5Se6qJaks5r_6EWgaJpZM4NrZnd>
> .
>

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Acquire semaphore before submit a p...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/978

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Acquire semaphore before submit a p...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/978

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2146/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Acquire semaphore before submit a p...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on the issue:

https://github.com/apache/carbondata/pull/978

@ravipesala ....Please correct me if I am wrong....do we need to acquire a semaphore lock if we do not have record to be processed....I think if the rows are exactly divisible we will not have any extra rows to be processed and hence we do not require to acquire semaphore lock and create a new Producer object

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Acquire semaphore before submit a p...

In reply to this post by qiuchenjian-2

Github user watermen commented on the issue:

https://github.com/apache/carbondata/pull/978

retest this please

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Acquire semaphore before submit a p...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/978

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2147/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] carbondata issue #978: [CARBONDATA-1109] Acquire semaphore before submit a p...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/978

@manishgupta88 It was the case in V1 and V2 formats, but in V3 format we should launch producer at finish to flush the older pages it was holding.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

12