Apache CarbonData Dev Mailing List archive - Re: [DISCUSSION]: (New Feature) Streaming Ingestion into CarbonData

Apache CarbonData Dev Mailing List archive

Re: [DISCUSSION]: (New Feature) Streaming Ingestion into CarbonData

Posted by Aniket Adnaik on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-New-Feature-Streaming-Ingestion-into-CarbonData-tp9724p9784.html

Hi Jacky,

Please see my comments below;
1. In this phase, is it still using columnar format? Save to a file for
every mini batch? If so, it is only readable after the file has been closed
and some metadata need to be kept to indicate the availability of the new
file.

AA >> yes, for initial phase it will use default columnar format and save
to file every mini batch. Closing of file may not be needed, as HDFS allows
single writer-multiple readers. But yes it will require us to maintain a
streaming_status file to let readers know about valid timestamp and offsets
during getsplits.

2. How to map the partition concept in spark to files in streaming segment?
I guess some small file will be created, right?

AA>> In streaming context, writeStream.partitionBy() may require CarbonData
to create separate folder for each partition.
Folder may look like \TableName\_Fact\part0\StreamingSegment\
*partition_0\streamingfile.001*
However, I am not sure how carbondata will utilize this partition info as
my assumption is currently CarbonData does not support partitioning.Also, I
am not sure if existing table with no partitioning schema can work well.
This needs further analysis.

3. Phase-2 : Add append support if not done in phase 1. Maintain append offsets
and metadata information.
Is the streaming data file format implemented in this phase?
AA>> I think we can directly leverage from existing V3 format without much
changes in basic writer/reader framework, in that case implementing
streaming file format is a possibility.

Best Regards,
Aniket

On Tue, Mar 28, 2017 at 8:22 AM, Jacky Li <[hidden email]> wrote:

> Hi Aniket,
>
> This feature looks great, the overall plan also seems fine to me. Thanks
> for proposing it.
> And I have some doubts inline.
>
> > 在 2017年3月27日，下午6:34，Aniket Adnaik <[hidden email]> 写道：
> >
> > Hi All,
> >
> > I would like to open up a discussion for new feature to support streaming
> > ingestion in CarbonData.
> >
> > Please refer to design document(draft) in the link below.
> > https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M
> > /view?usp=sharing
> >
> > Your comments/suggestions are welcome.
> > Here are some high level points.
> >
> > Rationale:
> > The current ways of adding user data to CarbonData table is via LOAD
> > statement or using SELECT query with INSERT INTO statement. These methods
> > add bulk of data into CarbonData table into a new segment. Basically, it
> is
> > a batch insertion for a bulk of data. However, with increasing demand of
> > real time data analytics with streaming frameworks, CarbonData needs a
> way
> > to insert streaming data continuously into CarbonData table. CarbonData
> > needs a support for continuous and faster ingestion into CarbonData table
> > and make it available for querying.
> >
> > CarbonData can leverage from our newly introduced V3 format to append
> > streaming data to existing carbon table.
> >
> >
> > Requirements:
> >
> > Following are some high level requirements;
> > 1. CarbonData shall create a new segment (Streaming Segment) for each
> > streaming session. Concurrent streaming ingestion into same table will
> > create separate streaming segments.
> >
> > 2. CarbonData shall use write optimized format (instead of multi-layered
> > indexed columnar format) to support ingestion of streaming data into a
> > CarbonData table.
> >
> > 3. CarbonData shall create streaming segment folder and open a streaming
> > data file in append mode to write data. CarbonData should avoid creating
> > multiple small files by appending to an existing file.
> >
> > 4. The data stored in new streaming segment shall be available for query
> > after it is written to the disk (hflush/hsync). In other words,
> CarbonData
> > Readers should be able to query the data in streaming segment written so
> > far.
> >
> > 5. CarbonData should acknowledge the write operation status back to
> output
> > sink/upper layer streaming engine so that in the case of write failure,
> > streaming engine should restart the operation and maintain exactly once
> > delivery semantics.
> >
> > 6. CarbonData Compaction process shall support compacting data from
> > write-optimized streaming segment to regular read optimized columnar
> > CarbonData format.
> >
> > 7. CarbonData readers should maintain the read consistency by means of
> > using timestamp.
> >
> > 8. Maintain durability - in case of write failure, CarbonData should be
> > able recover to latest commit status. This may require maintaining source
> > and destination offsets of last commits in a metadata.
> >
> > This feature can be done in phases;
> >
> > Phase -1 : Add basic framework and writer support to allow Spark
> Structured
> > streaming into CarbonData . This phase may or may not have append
> support.
> > Add reader support to read streaming data files.
> >
> 1. In this phase, is it still using columnar format? Save to a file for
> every mini batch? If so, it is only readable after the file has been closed
> and some metadata need to be kept to indicate the availability of the new
> file.
> 2. How to map the partition concept in spark to files in streaming
> segment? I guess some small file will be created, right?
>
> > Phase-2 : Add append support if not done in phase 1. Maintain append
> > offsets and metadata information.
> >
> Is the streaming data file format implemented in this phase?
>
> > Phase -3 : Add support for external streaming frameworks such as Kafka
> > streaming using spark structured steaming, maintain
> > topics/partitions/offsets and support fault tolerance .
> >
> > Phase-4 : Add support to other streaming frameworks , such as flink ,
> beam
> > etc.
> >
> > Phase-5: Future support for in-memory cache for buffering streaming data,
> > support for union with Spark Structured streaming to serve directly from
> > spark structured streaming. And add support for Time series data.
> >
> > Best Regards,
> > Aniket
> >
>
>
>
>