http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-New-Feature-Streaming-Ingestion-into-CarbonData-tp9724p9784.html
1. In this phase, is it still using columnar format? Save to a file for
to file every mini batch. Closing of file may not be needed, as HDFS allows
single writer-multiple readers. But yes it will require us to maintain a
2. How to map the partition concept in spark to files in streaming segment?
3. Phase-2 : Add append support if not done in phase 1. Maintain append offsets
> Hi Aniket,
>
> This feature looks great, the overall plan also seems fine to me. Thanks
> for proposing it.
> And I have some doubts inline.
>
> > 在 2017年3月27日,下午6:34,Aniket Adnaik <
[hidden email]> 写道:
> >
> > Hi All,
> >
> > I would like to open up a discussion for new feature to support streaming
> > ingestion in CarbonData.
> >
> > Please refer to design document(draft) in the link below.
> >
https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M> > /view?usp=sharing
> >
> > Your comments/suggestions are welcome.
> > Here are some high level points.
> >
> > Rationale:
> > The current ways of adding user data to CarbonData table is via LOAD
> > statement or using SELECT query with INSERT INTO statement. These methods
> > add bulk of data into CarbonData table into a new segment. Basically, it
> is
> > a batch insertion for a bulk of data. However, with increasing demand of
> > real time data analytics with streaming frameworks, CarbonData needs a
> way
> > to insert streaming data continuously into CarbonData table. CarbonData
> > needs a support for continuous and faster ingestion into CarbonData table
> > and make it available for querying.
> >
> > CarbonData can leverage from our newly introduced V3 format to append
> > streaming data to existing carbon table.
> >
> >
> > Requirements:
> >
> > Following are some high level requirements;
> > 1. CarbonData shall create a new segment (Streaming Segment) for each
> > streaming session. Concurrent streaming ingestion into same table will
> > create separate streaming segments.
> >
> > 2. CarbonData shall use write optimized format (instead of multi-layered
> > indexed columnar format) to support ingestion of streaming data into a
> > CarbonData table.
> >
> > 3. CarbonData shall create streaming segment folder and open a streaming
> > data file in append mode to write data. CarbonData should avoid creating
> > multiple small files by appending to an existing file.
> >
> > 4. The data stored in new streaming segment shall be available for query
> > after it is written to the disk (hflush/hsync). In other words,
> CarbonData
> > Readers should be able to query the data in streaming segment written so
> > far.
> >
> > 5. CarbonData should acknowledge the write operation status back to
> output
> > sink/upper layer streaming engine so that in the case of write failure,
> > streaming engine should restart the operation and maintain exactly once
> > delivery semantics.
> >
> > 6. CarbonData Compaction process shall support compacting data from
> > write-optimized streaming segment to regular read optimized columnar
> > CarbonData format.
> >
> > 7. CarbonData readers should maintain the read consistency by means of
> > using timestamp.
> >
> > 8. Maintain durability - in case of write failure, CarbonData should be
> > able recover to latest commit status. This may require maintaining source
> > and destination offsets of last commits in a metadata.
> >
> > This feature can be done in phases;
> >
> > Phase -1 : Add basic framework and writer support to allow Spark
> Structured
> > streaming into CarbonData . This phase may or may not have append
> support.
> > Add reader support to read streaming data files.
> >
> 1. In this phase, is it still using columnar format? Save to a file for
> every mini batch? If so, it is only readable after the file has been closed
> and some metadata need to be kept to indicate the availability of the new
> file.
> 2. How to map the partition concept in spark to files in streaming
> segment? I guess some small file will be created, right?
>
> > Phase-2 : Add append support if not done in phase 1. Maintain append
> > offsets and metadata information.
> >
> Is the streaming data file format implemented in this phase?
>
> > Phase -3 : Add support for external streaming frameworks such as Kafka
> > streaming using spark structured steaming, maintain
> > topics/partitions/offsets and support fault tolerance .
> >
> > Phase-4 : Add support to other streaming frameworks , such as flink ,
> beam
> > etc.
> >
> > Phase-5: Future support for in-memory cache for buffering streaming data,
> > support for union with Spark Structured streaming to serve directly from
> > spark structured streaming. And add support for Time series data.
> >
> > Best Regards,
> > Aniket
> >
>
>
>
>