GitHub user watermen opened a pull request:
https://github.com/apache/carbondata/pull/978 Cover the case when last page is not be consumed at the end First, we use Producer-Consumer model in the write step, we have n(default value is 2 and can be configured) producers and one consumer. The task of generate last page(less than 32000) is added to thread pool at the end, but can't be guaranteed to be finished and add to BlockletDataHolder at the end. Because we have n tasks running concurrently. Second, we have 2 ways to invoke `writeDataToFile`, one is the size of `DataWriterHolder` reach the size of blocklet and two is the page is the last page. So if the last page is not be consumed at the end, we lost the page which be consumed after last page. This PR add a flag named isLastPageWrited to make sure every page is writed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/watermen/incubator-carbondata CARBONDATA-1109 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/978.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #978 ---- commit 6e34389cbb011734078d8c2431065d1f04fc891f Author: Yadong Qi <[hidden email]> Date: 2017-05-31T09:35:45Z Cover the case when last page is not be consumed at the end. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
Github user kumarvishal09 commented on the issue:
https://github.com/apache/carbondata/pull/978 @watermen Can u please add a testcase --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user watermen commented on the issue:
https://github.com/apache/carbondata/pull/978 test it please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user watermen commented on the issue:
https://github.com/apache/carbondata/pull/978 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user watermen commented on the issue:
https://github.com/apache/carbondata/pull/978 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user watermen commented on the issue:
https://github.com/apache/carbondata/pull/978 @kumarvishal09 It is very hard to write the testcase because this case is happened occasionally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user QiangCai commented on the issue:
https://github.com/apache/carbondata/pull/978 We need make sure all pages of a blocklet are in the right order to be written to carbon data file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/978 @watermen This fix is contradictory to the design of producer and consumer. The producer process the data and puts in holder in sequence and consumer takes the data in the same sequence. So for the last set of data there is already a boolean flag `writeAll` inside `NodeHolder` enabled to write all the remaining data. So this fix should not solve the issue, if it solves then there is an issue in producer/consumer design. Please provide the reproduce steps of this issue in the jira. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user watermen commented on the issue:
https://github.com/apache/carbondata/pull/978 @ravipesala You can reproduce this case by add some log and run loading sample.csv testcase(TestDataLoadWithFileName). It's hard to reproducer, so I add sleep in Producer to simulate the time of processing data. ![image](https://cloud.githubusercontent.com/assets/1400819/26669092/19e9535e-46df-11e7-9d02-c12b05bb4a90.png) ![image](https://cloud.githubusercontent.com/assets/1400819/26669073/0c56c64a-46df-11e7-8071-6ba2e7fa5785.png) The log lost page like below: ``` 00:15:42 [Thread-65]###addDataToStore 00:15:42 [Thread-65]###addDataToStore 00:15:43 [Thread-73]###Put ---> isWriteAll:false index:1 00:15:44 [Thread-72]###Put ---> isWriteAll:false index:0 00:15:44 [Thread-71]###Get ---> isWriteAll:false index:0 00:15:44 [Thread-71]###Get ---> isWriteAll:false index:1 00:15:44 [Thread-65]###addDataToStore 00:15:44 [Thread-65]###addDataToStore 00:15:44 [Thread-65]###finish 00:15:46 [Thread-72]###Put ---> isWriteAll:false index:1 00:15:46 [Thread-72]###Put ---> isWriteAll:true index:0 00:15:46 [Thread-71]###Get ---> isWriteAll:true index:0// Last page is not be consumed at the end. 00:15:46 [Thread-71]###Get ---> isWriteAll:false index:1 00:15:47 [Thread-73]###Put ---> isWriteAll:false index:0 00:15:47 [Thread-71]###Get ---> isWriteAll:false index:0 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/978 @watermen Thank you for giving reproduce steps, I can reproduce it. But the fix you have given is not right, This issue happens because of missing `semaphore.acquire()` in `finish` method. So solution is just adding 'semaphore.acquire()' in starting of `finish()` method solves the issue. Please update the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user watermen commented on the issue:
https://github.com/apache/carbondata/pull/978 @ravipesala Thanks for your solution, PR updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/978 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2140/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/978#discussion_r119785140 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -488,13 +488,19 @@ private NodeHolder processDataRows(List<CarbonRow> dataRows) public void finish() throws CarbonDataWriterException { // still some data is present in stores if entryCount is more // than 0 - producerExecutorServiceTaskList.add(producerExecutorService - .submit(new Producer(blockletDataHolder, dataRows, ++writerTaskSequenceCounter, true))); - blockletProcessingCount.incrementAndGet(); - processedDataCount += entryCount; - closeWriterExecutionService(producerExecutorService); - processWriteTaskSubmitList(producerExecutorServiceTaskList); - processingComplete = true; + try { + semaphore.acquire(); --- End diff -- @watermen .....here a check is required for entryCount > 0.....because we need to acquire a semaphore lock only if total number of rows in raw data are not exactly divisible by page size...in this case only we will have some extra rows to be process by finish method else addDataToStore method will handle all the rows....Please refer the below code snippet.... public void finish() throws CarbonDataWriterException { // still some data is present in stores if entryCount is more // than 0 if (this.entryCount > 0) { try { semaphore.acquire(); producerExecutorServiceTaskList.add(producerExecutorService .submit(new Producer(blockletDataHolder, dataRows, ++writerTaskSequenceCounter, true))); blockletProcessingCount.incrementAndGet(); processedDataCount += entryCount; } catch (InterruptedException e) { LOGGER.error(e, e.getMessage()); throw new CarbonDataWriterException(e.getMessage(), e); } } closeWriterExecutionService(producerExecutorService); processWriteTaskSubmitList(producerExecutorServiceTaskList); processingComplete = true; --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/978 Not required any check, entry count check handled inside On Fri, 2 Jun 2017 at 11:13 AM, manishgupta88 <[hidden email]> wrote: > *@manishgupta88* commented on this pull request. > ------------------------------ > > In > processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java > <https://github.com/apache/carbondata/pull/978#discussion_r119785140>: > > > @@ -488,13 +488,19 @@ private NodeHolder processDataRows(List<CarbonRow> dataRows) > public void finish() throws CarbonDataWriterException { > // still some data is present in stores if entryCount is more > // than 0 > - producerExecutorServiceTaskList.add(producerExecutorService > - .submit(new Producer(blockletDataHolder, dataRows, ++writerTaskSequenceCounter, true))); > - blockletProcessingCount.incrementAndGet(); > - processedDataCount += entryCount; > - closeWriterExecutionService(producerExecutorService); > - processWriteTaskSubmitList(producerExecutorServiceTaskList); > - processingComplete = true; > + try { > + semaphore.acquire(); > > @watermen <https://github.com/watermen> .....here a check is required for > entryCount > 0.....because we need to acquire a semaphore lock only if > total number of rows in raw data are not exactly divisible by page > size...in this case only we will have some extra rows to be process by > finish method else addDataToStore method will handle all the rows....Please > refer the below code snippet.... > > public void finish() throws CarbonDataWriterException { > // still some data is present in stores if entryCount is more > // than 0 > if (this.entryCount > 0) { > try { > semaphore.acquire(); > producerExecutorServiceTaskList.add(producerExecutorService > .submit(new Producer(blockletDataHolder, dataRows, > ++writerTaskSequenceCounter, true))); > blockletProcessingCount.incrementAndGet(); > processedDataCount += entryCount; > } catch (InterruptedException e) { > LOGGER.error(e, e.getMessage()); > throw new CarbonDataWriterException(e.getMessage(), e); > } > } > closeWriterExecutionService(producerExecutorService); > processWriteTaskSubmitList(producerExecutorServiceTaskList); > processingComplete = true; > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/carbondata/pull/978#pullrequestreview-41695875>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AHwdxhu_fSOnCMiMXsBYK45_C5Se6qJaks5r_6EWgaJpZM4NrZnd> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/978 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/978 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2146/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on the issue:
https://github.com/apache/carbondata/pull/978 @ravipesala ....Please correct me if I am wrong....do we need to acquire a semaphore lock if we do not have record to be processed....I think if the rows are exactly divisible we will not have any extra rows to be processed and hence we do not require to acquire semaphore lock and create a new Producer object --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user watermen commented on the issue:
https://github.com/apache/carbondata/pull/978 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/978 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2147/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/978 @manishgupta88 It was the case in V1 and V2 formats, but in V3 format we should launch producer at finish to flush the older pages it was holding. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
Free forum by Nabble | Edit this page |