Hi Carbondata community,
I am evaluating various file format right now and found Carbondata to be interesting specially with the multiple index used to avoid full scan but I am asking if there is any way to achieve idem potence when writing to Carbondata from Spark (or alternative) ? A strong requirement is to avoid a Spark worker crash to write duplicated entries in Carbon... Tx Vincent |
Administrator
|
Hi Vincent
Happy to hear you are interested in Apache CarbonData. Write to CarbonData file from Spark, please refer to the example : DataFrameAPIExample. Can you explain more about this requirement:A strong requirement is to avoid a Spark worker crash to write duplicated entries in Carbon... ? Regards Liang |
In reply to this post by vincent
Hi Vincent,
Are you referring to writing out Spark streaming data to Carbon file? we don't support it yet, but it is in our near term plan to add the integration, we will start the discussion in the dev list soon and would love to hear your input, we will take into account the old DStream interface as well as Spark 2.0 structured streaming, we would like to ensure exactly-once semantics and design Carbon as an idempotent sink. At the moment, we have fully integrated with Spark SQL with both SQL and API interface, with the help multi-level indexes, we have seen dramatic performance boost compared to other columnar file format on hadoop eco-system. You are welcome to try it out for your batch processing workload, the streaming ingest will come out a little later. Regards. Jenny -----Original Message----- From: vincent gromakowski [mailto:[hidden email]] Sent: Friday, September 23, 2016 7:33 AM To: [hidden email] Subject: carbondata and idempotence Hi Carbondata community, I am evaluating various file format right now and found Carbondata to be interesting specially with the multiple index used to avoid full scan but I am asking if there is any way to achieve idem potence when writing to Carbondata from Spark (or alternative) ? A strong requirement is to avoid a Spark worker crash to write duplicated entries in Carbon... Tx Vincent |
Hi
I second Jenny here. It's not yet supported but definitely a good feature. Regards JB On Sep 23, 2016, 14:03, at 14:03, Jihong Ma <[hidden email]> wrote: >Hi Vincent, > >Are you referring to writing out Spark streaming data to Carbon file? >we don't support it yet, but it is in our near term plan to add the >integration, we will start the discussion in the dev list soon and >would love to hear your input, we will take into account the old >DStream interface as well as Spark 2.0 structured streaming, we would >like to ensure exactly-once semantics and design Carbon as an >idempotent sink. > >At the moment, we have fully integrated with Spark SQL with both SQL >and API interface, with the help multi-level indexes, we have seen >dramatic performance boost compared to other columnar file format on >hadoop eco-system. You are welcome to try it out for your batch >processing workload, the streaming ingest will come out a little later. > > >Regards. > >Jenny > >-----Original Message----- >From: vincent gromakowski [mailto:[hidden email]] >Sent: Friday, September 23, 2016 7:33 AM >To: [hidden email] >Subject: carbondata and idempotence > >Hi Carbondata community, >I am evaluating various file format right now and found Carbondata to >be >interesting specially with the multiple index used to avoid full scan >but I >am asking if there is any way to achieve idem potence when writing to >Carbondata from Spark (or alternative) ? >A strong requirement is to avoid a Spark worker crash to write >duplicated >entries in Carbon... >Tx > >Vincent |
Hi
thanks for your answer. My question is about both streaming and batch. Even in batch if a worker crash or if speculation is activated, the worker's task that failed will be relaunched on another worker. For example the worker has crashed after having ingested 20 000 lines on the 100 000 lines of the task, then the new worker will write the entire 100 000 lines and then resulting in 20 000 duplicated entries in the storage layer. This issue is generally managed by using primary key or transactions so the new task will override the 20 000 lines, or the transaction of the first 20 000 lines would be rolled back. |
Hi Vincent,
In batch mode, with overwrite savemode, we can achieve exactly-once as we will simply overwrite if there are existing files, other than that, there is no guarantee since DF/DS/RDD doesn't maintain any checkpoints/WAL to know where it left before crash.. In Streaming mode, we will consider go further to guarantee exactly-once semantics with the help of check-pointing the offset/WAL, and introduce 'transactional' state to uniquely identify the current batch of data, and only write it out once (ignore if it already exists). Jihong -----Original Message----- From: vincent [mailto:[hidden email]] Sent: Tuesday, September 27, 2016 7:11 AM To: [hidden email] Subject: RE: carbondata and idempotence Hi thanks for your answer. My question is about both streaming and batch. Even in batch if a worker crash or if speculation is activated, the worker's task that failed will be relaunched on another worker. For example the worker has crashed after having ingested 20 000 lines on the 100 000 lines of the task, then the new worker will write the entire 100 000 lines and then resulting in 20 000 duplicated entries in the storage layer. This issue is generally managed by using primary key or transactions so the new task will override the 20 000 lines, or the transaction of the first 20 000 lines would be rolled back. -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-and-idempotence-tp1416p1518.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com. |
Free forum by Nabble | Edit this page |