Apache CarbonData Dev Mailing List archive - Re: Should CarbonData need to integrate with Spark Streaming too?

Apache CarbonData Dev Mailing List archive

Re: Should CarbonData need to integrate with Spark Streaming too?

Posted by xm_zzc on Jan 17, 2018; 7:26am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Should-CarbonData-need-to-integrate-with-Spark-Streaming-too-tp35341p35458.html

Hi Jacky:

>> 1). CarbonData batch data loading + Auto compaction
>> Use CarbonSession.createDataFrame to convert rdd to DataFrame in
>> InputDStream.foreachRDD, and then save rdd data into CarbonData table
>> which
>> support auto compaction. In this way, it can support to create
>> pre-aggregate
>> tables on this main table too (Streaming table does not support to create
>> pre-aggregate tables on it).
>>
>> I can test with this way in our QA env and add example to CarbonData.
>
>This approach is doable, but the loading interval should be relative longer

since it still uses columnar file in >this approach. I am not sure how
frequent you do one batch load?

Agree. the loading interval should be relative longer, maybe 15s, 30s, even
1min, but it is also related to the data size of every mini-batch.

>> 2). The same as integration with Structured Streaming
>> With this way, Structured Streaming append every mini-batch data into
>> stream segment which is row format, and then when the size of stream
>> segment
>> is greater than 'carbon.streaming.segment.max.size', it will auto convert
>> stream segment to batch segment(column format) at the begin of each batch
>> and create a new stream segment to append data.
>> However, I have no idea how to integrate with Spark Streaming yet, *any
>> suggestion for this*?
>>
>
>You can refer to the logic in CarbonAppendableStreamSink.addBatch,

basically it launches a job to do >appending to row format files in the
streaming segment by invoking >CarbonAppendableStreamSink.writeDataFileJob.
At beginning, you can invoke checkOrHandOffSegment >to create the streaming
segment.
>I think integrate with the SparkStreaming is a good feature to have, it
enables more user to use carbon >streaming ingest feature on existing
cluster setting with old spark and Kafka version.
>Please feel free to create JIRA ticket and discuss in the community.

OK, I have read the code of streaming module , and discussed with David
offline, I will implement this feature ASAP.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/