Apache CarbonData Dev Mailing List archive

RE: carbondata and idempotence

Posted by Jihong Ma on Sep 23, 2016; 9:02pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbondata-and-idempotence-tp1416p1419.html

Hi Vincent,

Are you referring to writing out Spark streaming data to Carbon file? we don't support it yet, but it is in our near term plan to add the integration, we will start the discussion in the dev list soon and would love to hear your input, we will take into account the old DStream interface as well as Spark 2.0 structured streaming, we would like to ensure exactly-once semantics and design Carbon as an idempotent sink.

At the moment, we have fully integrated with Spark SQL with both SQL and API interface, with the help multi-level indexes, we have seen dramatic performance boost compared to other columnar file format on hadoop eco-system. You are welcome to try it out for your batch processing workload, the streaming ingest will come out a little later.

Regards.

Jenny

-----Original Message-----
From: vincent gromakowski [mailto:[hidden email]]
Sent: Friday, September 23, 2016 7:33 AM
To: [hidden email]
Subject: carbondata and idempotence

Hi Carbondata community,
I am evaluating various file format right now and found Carbondata to be
interesting specially with the multiple index used to avoid full scan but I
am asking if there is any way to achieve idem potence when writing to
Carbondata from Spark (or alternative) ?
A strong requirement is to avoid a Spark worker crash to write duplicated
entries in Carbon...
Tx

Vincent