Apache CarbonData Dev Mailing List archive - Re: Should CarbonData need to integrate with Spark Streaming too?

Apache CarbonData Dev Mailing List archive

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by Liang Chen on Jan 17, 2018; 4:14am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Should-CarbonData-need-to-integrate-with-Spark-Streaming-too-tp35341p35415.html

Hi

Thanks for you started this discussion for adding spark streaming support.
1. Please try to utilize the current code(structured streaming), not adding
separated logic code for spark streaming.
2. I suggest that by default is using structured streaming , please consider
how to make configuration for enabling/switching to spark streaming.

Regards
Liang

xm_zzc wrote

> Hi dev:
> Currently CarbonData 1.3(will be released soon) just support to
> integrate
> with Spark Structured Streaming which requires Kafka's version must be >=
> 0.10. I think there are still many users integrating Spark Streaming with
> kafka 0.8, at least our cluster is, but the cost of upgrading kafka is too
> much. So should CarbonData need to integrate with Spark Streaming too?
>
> I think there are two ways to integrate with Spark Streaming, as
> following:
> 1). CarbonData batch data loading + Auto compaction
> Use CarbonSession.createDataFrame to convert rdd to DataFrame in
> InputDStream.foreachRDD, and then save rdd data into CarbonData table
> which
> support auto compaction. In this way, it can support to create
> pre-aggregate
> tables on this main table too (Streaming table does not support to create
> pre-aggregate tables on it).
>
> I can test with this way in our QA env and add example to CarbonData.
>
> 2). The same as integration with Structured Streaming
> With this way, Structured Streaming append every mini-batch data into
> stream segment which is row format, and then when the size of stream
> segment
> is greater than 'carbon.streaming.segment.max.size', it will auto convert
> stream segment to batch segment(column format) at the begin of each batch
> and create a new stream segment to append data.
> However, I have no idea how to integrate with Spark Streaming yet, *any
> suggestion for this*?
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/