Apache CarbonData Dev Mailing List archive

RE: carbondata and idempotence

Posted by vincent on Sep 27, 2016; 2:11pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbondata-and-idempotence-tp1416p1518.html

Hi
thanks for your answer. My question is about both streaming and batch. Even in batch if a worker crash or if speculation is activated, the worker's task that failed will be relaunched on another worker. For example the worker has crashed after having ingested 20 000 lines on the 100 000 lines of the task, then the new worker will write the entire 100 000 lines and then resulting in 20 000 duplicated entries in the storage layer.
This issue is generally managed by using primary key or transactions so the new task will override the 20 000 lines, or the transaction of the first 20 000 lines would be rolled back.