Posted by
xm_zzc on
Jan 16, 2018; 5:38pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Should-CarbonData-need-to-integrate-with-Spark-Streaming-too-tp35341.html
Hi dev:
Currently CarbonData 1.3(will be released soon) just support to integrate
with Spark Structured Streaming which requires Kafka's version must be >=
0.10. I think there are still many users integrating Spark Streaming with
kafka 0.8, at least our cluster is, but the cost of upgrading kafka is too
much. So should CarbonData need to integrate with Spark Streaming too?
I think there are two ways to integrate with Spark Streaming, as
following:
1). CarbonData batch data loading + Auto compaction
Use CarbonSession.createDataFrame to convert rdd to DataFrame in
InputDStream.foreachRDD, and then save rdd data into CarbonData table which
support auto compaction. In this way, it can support to create pre-aggregate
tables on this main table too (Streaming table does not support to create
pre-aggregate tables on it).
I can test with this way in our QA env and add example to CarbonData.
2). The same as integration with Structured Streaming
With this way, Structured Streaming append every mini-batch data into
stream segment which is row format, and then when the size of stream segment
is greater than 'carbon.streaming.segment.max.size', it will auto convert
stream segment to batch segment(column format) at the begin of each batch
and create a new stream segment to append data.
However, I have no idea how to integrate with Spark Streaming yet, *any
suggestion for this*?
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/