Apache CarbonData Dev Mailing List archive - [jira] [Created] (CARBONDATA-1072) Streaming Ingestion Feature

Apache CarbonData Dev Mailing List archive

[jira] [Created] (CARBONDATA-1072) Streaming Ingestion Feature

Posted by Akash R Nilugal (Jira) on May 20, 2017; 3:52am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/jira-Created-CARBONDATA-1072-Streaming-Ingestion-Feature-tp13024.html

Aniket Adnaik created CARBONDATA-1072:
-----------------------------------------

Summary: Streaming Ingestion Feature
Key: CARBONDATA-1072
URL: https://issues.apache.org/jira/browse/CARBONDATA-1072
Project: CarbonData
Issue Type: New Feature
Components: core, data-load, data-query, examples, file-format, spark-integration, sql
Affects Versions: 1.1.0
Reporter: Aniket Adnaik
Fix For: 1.2.0

High level break down of work Items/Implementation phases:
Design document will be attached soon.

Phase – 1 – Spark Structured Streaming with regular Carbondata Format
----------------------------
This phase will mainly focus on supporting Streaming ingestion using
Spark Structured streaming
1. Write Path Implementation
- Integration with Spark’s Structured Streaming framework
(FileStreamSink etc)
- StreamingOutputWriter (StreamingOuputWriterFactory)
- Prepare Write (Schema Validation, Segment creation,
Streaming file creation etc)
- StreamingRecordWriter ( Data conversion from Catalyst InternalRow
to Carbondata compatible format , make use of new load path)

2. Read Path Implementation (some overlap with phase-2)
- Modify getsplits() to read from Streaming Segment
- Read commited info from meta data to get correct offsets
- Make use of Min-Max index if available
- Use sequential scan - data is unsorted , cannot use Btree index

3. Compaction
- Minor Compaction
- Major Compaction

4. Metadata Management
- Streaming metadata store (e.g. Offsets, timestamps etc.)

5. Failure Recovery
- Rollback on failure
- Handle asynchronous writes to CarbonData (using hflush)
----------------------------
Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
1.Streaming File Format
- Writers use V3 file format for appending Columnar unsorted
data blockets
- Modify Readers to read from appendable streaming file format
-----------------------------
Phase -3 :
1. Inter-opertability Support
- Functionality with other features/Components
- Concurrent queries with streaming ingestion
- Concurrent operations with Streaming Ingestion (e.g. Compaction,
Alter table, Secondary Index etc.
2. Kafka Connect Ingestion / Carbondata connector
- Direct ingestion from Kafka Connect without Spark Structured
Streaming
- Separate Kafka Connector to receive data through network port
- Data commit and Offset management
-----------------------------
Phase-4 : Support for other streaming engines
- Analysis of Streaming APIs/interface with other streaming engines
- Implementation of connectors for different streaming engines storm,
flink , flume, etc.

-----------------------------
Phase -5 : In-memory Streaming table (probable feature)
-----------------------------
1. In-memory Cache for Streaming data
- Fault tolerant in-memory buffering / checkpoint with WAL
- Readers read from in-memory tables if available
- Background threads for writing streaming data ,etc.

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)