Apache CarbonData Dev Mailing List archive

RE: carbondata and idempotence

Posted by Jean-Baptiste Onofré on Sep 23, 2016; 9:09pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbondata-and-idempotence-tp1416p1420.html

Hi

I second Jenny here. It's not yet supported but definitely a good feature.

Regards
JB

On Sep 23, 2016, 14:03, at 14:03, Jihong Ma <[hidden email]> wrote:

>Hi Vincent,
>
>Are you referring to writing out Spark streaming data to Carbon file?
>we don't support it yet, but it is in our near term plan to add the
>integration, we will start the discussion in the dev list soon and
>would love to hear your input, we will take into account the old
>DStream interface as well as Spark 2.0 structured streaming, we would
>like to ensure exactly-once semantics and design Carbon as an
>idempotent sink.
>
>At the moment, we have fully integrated with Spark SQL with both SQL
>and API interface, with the help multi-level indexes, we have seen
>dramatic performance boost compared to other columnar file format on
>hadoop eco-system. You are welcome to try it out for your batch
>processing workload, the streaming ingest will come out a little later.
>
>
>Regards.
>
>Jenny
>
>-----Original Message-----
>From: vincent gromakowski [mailto:[hidden email]]
>Sent: Friday, September 23, 2016 7:33 AM
>To: [hidden email]
>Subject: carbondata and idempotence
>
>Hi Carbondata community,
>I am evaluating various file format right now and found Carbondata to
>be
>interesting specially with the multiple index used to avoid full scan
>but I
>am asking if there is any way to achieve idem potence when writing to
>Carbondata from Spark (or alternative) ?
>A strong requirement is to avoid a Spark worker crash to write
>duplicated
>entries in Carbon...
>Tx
>
>Vincent