Login  Register

RE: carbondata and idempotence

Posted by Jihong Ma on Sep 27, 2016; 6:26pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbondata-and-idempotence-tp1416p1521.html

Hi Vincent,

In batch mode, with overwrite savemode, we can achieve exactly-once as we will simply overwrite if there are existing files, other than that, there is no guarantee since DF/DS/RDD doesn't maintain any checkpoints/WAL to know where it left before crash..

In Streaming mode, we will consider go further to guarantee exactly-once semantics with the help of check-pointing the offset/WAL, and introduce 'transactional' state to uniquely identify the current batch of data, and only write it out once (ignore if it already exists).

Jihong

-----Original Message-----
From: vincent [mailto:[hidden email]]
Sent: Tuesday, September 27, 2016 7:11 AM
To: [hidden email]
Subject: RE: carbondata and idempotence

Hi
thanks for your answer. My question is about both streaming and batch. Even
in batch if a worker crash or if speculation is activated, the worker's task
that failed will be relaunched on another worker. For example the worker has
crashed after having ingested 20 000 lines on the 100 000 lines of the task,
then the new worker will write the entire 100 000 lines and then resulting
in 20 000 duplicated entries in the storage layer.
This issue is generally managed by using primary key or transactions so the
new task will override the 20 000 lines, or the transaction of the first 20
000 lines would be rolled back.



--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-and-idempotence-tp1416p1518.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.