Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[jira] [Updated] (CARBONDATA-1072) Streaming Ingestion Feature

Classic

List

Threaded

1 message

Akash R Nilugal (Jira)

[jira] [Updated] (CARBONDATA-1072) Streaming Ingestion Feature

[ https://issues.apache.org/jira/browse/CARBONDATA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Adnaik updated CARBONDATA-1072:
--------------------------------------
Request participants: (was: )
Description:
High level break down of work Items/Implementation phases:
Design document will be attached soon.

Phase – 1 – Spark Structured Streaming with regular Carbondata Format
----------------------------
This phase will mainly focus on supporting Streaming ingestion using
Spark Structured streaming
1. Write Path Implementation
- Integration with Spark’s Structured Streaming framework
(FileStreamSink etc)
- StreamingOutputWriter (StreamingOuputWriterFactory)
- Prepare Write (Schema Validation, Segment creation,
Streaming file creation etc)
- StreamingRecordWriter ( Data conversion from Catalyst InternalRow
to Carbondata compatible format , make use of new load path)

2. Read Path Implementation (some overlap with phase-2)
- Modify getsplits() to read from Streaming Segment
- Read commited info from meta data to get correct offsets
- Make use of Min-Max index if available
- Use sequential scan - data is unsorted , cannot use Btree index

3. Compaction
- Minor Compaction
- Major Compaction

4. Metadata Management
- Streaming metadata store (e.g. Offsets, timestamps etc.)

5. Failure Recovery
- Rollback on failure
- Handle asynchronous writes to CarbonData (using hflush)
-----------------------------
Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
1.Streaming File Format
- Writers use V3 file format for appending Columnar unsorted
data blockets
- Modify Readers to read from appendable streaming file format
-----------------------------
Phase -3 :
1. Inter-opertability Support
- Functionality with other features/Components
- Concurrent queries with streaming ingestion
- Concurrent operations with Streaming Ingestion (e.g. Compaction,
Alter table, Secondary Index etc.)
2. Kafka Connect Ingestion / Carbondata connector
- Direct ingestion from Kafka Connect without Spark Structured
Streaming
- Separate Kafka Connector to receive data through network port
- Data commit and Offset management
-----------------------------
Phase-4 : Support for other streaming engines
- Analysis of Streaming APIs/interface with other streaming engines
- Implementation of connectors for different streaming engines storm,
flink , flume, etc.
----------------------------
Phase -5 : In-memory Streaming table (probable feature)
1. In-memory Cache for Streaming data
- Fault tolerant in-memory buffering / checkpoint with WAL
- Readers read from in-memory tables if available
- Background threads for writing streaming data ,etc.

was:
High level break down of work Items/Implementation phases:
Design document will be attached soon.

Phase – 1 – Spark Structured Streaming with regular Carbondata Format
----------------------------
This phase will mainly focus on supporting Streaming ingestion using
Spark Structured streaming
1. Write Path Implementation
- Integration with Spark’s Structured Streaming framework
(FileStreamSink etc)
- StreamingOutputWriter (StreamingOuputWriterFactory)
- Prepare Write (Schema Validation, Segment creation,
Streaming file creation etc)
- StreamingRecordWriter ( Data conversion from Catalyst InternalRow
to Carbondata compatible format , make use of new load path)

2. Read Path Implementation (some overlap with phase-2)
- Modify getsplits() to read from Streaming Segment
- Read commited info from meta data to get correct offsets
- Make use of Min-Max index if available
- Use sequential scan - data is unsorted , cannot use Btree index

3. Compaction
- Minor Compaction
- Major Compaction

4. Metadata Management
- Streaming metadata store (e.g. Offsets, timestamps etc.)

5. Failure Recovery
- Rollback on failure
- Handle asynchronous writes to CarbonData (using hflush)
----------------------------
Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
1.Streaming File Format
- Writers use V3 file format for appending Columnar unsorted
data blockets
- Modify Readers to read from appendable streaming file format
-----------------------------
Phase -3 :
1. Inter-opertability Support
- Functionality with other features/Components
- Concurrent queries with streaming ingestion
- Concurrent operations with Streaming Ingestion (e.g. Compaction,
Alter table, Secondary Index etc.
2. Kafka Connect Ingestion / Carbondata connector
- Direct ingestion from Kafka Connect without Spark Structured
Streaming
- Separate Kafka Connector to receive data through network port
- Data commit and Offset management
-----------------------------
Phase-4 : Support for other streaming engines
- Analysis of Streaming APIs/interface with other streaming engines
- Implementation of connectors for different streaming engines storm,
flink , flume, etc.

-----------------------------
Phase -5 : In-memory Streaming table (probable feature)
-----------------------------
1. In-memory Cache for Streaming data
- Fault tolerant in-memory buffering / checkpoint with WAL
- Readers read from in-memory tables if available
- Background threads for writing streaming data ,etc.

> Streaming Ingestion Feature
> ----------------------------
>
> Key: CARBONDATA-1072
> URL: https://issues.apache.org/jira/browse/CARBONDATA-1072
> Project: CarbonData
> Issue Type: New Feature
> Components: core, data-load, data-query, examples, file-format, spark-integration, sql
> Affects Versions: 1.1.0
> Reporter: Aniket Adnaik
> Fix For: 1.2.0
>
>
> High level break down of work Items/Implementation phases:
> Design document will be attached soon.
>
> Phase – 1 – Spark Structured Streaming with regular Carbondata Format
> ----------------------------
> This phase will mainly focus on supporting Streaming ingestion using
> Spark Structured streaming
> 1. Write Path Implementation
> - Integration with Spark’s Structured Streaming framework
> (FileStreamSink etc)
> - StreamingOutputWriter (StreamingOuputWriterFactory)
> - Prepare Write (Schema Validation, Segment creation,
> Streaming file creation etc)
> - StreamingRecordWriter ( Data conversion from Catalyst InternalRow
> to Carbondata compatible format , make use of new load path)
> 2. Read Path Implementation (some overlap with phase-2)
> - Modify getsplits() to read from Streaming Segment
> - Read commited info from meta data to get correct offsets
> - Make use of Min-Max index if available
> - Use sequential scan - data is unsorted , cannot use Btree index
> 3. Compaction
> - Minor Compaction
> - Major Compaction
> 4. Metadata Management
> - Streaming metadata store (e.g. Offsets, timestamps etc.)
>
> 5. Failure Recovery
> - Rollback on failure
> - Handle asynchronous writes to CarbonData (using hflush)
> -----------------------------
> Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
> 1.Streaming File Format
> - Writers use V3 file format for appending Columnar unsorted
> data blockets
> - Modify Readers to read from appendable streaming file format
> -----------------------------
> Phase -3 :
> 1. Inter-opertability Support
> - Functionality with other features/Components
> - Concurrent queries with streaming ingestion
> - Concurrent operations with Streaming Ingestion (e.g. Compaction,
> Alter table, Secondary Index etc.)
> 2. Kafka Connect Ingestion / Carbondata connector
> - Direct ingestion from Kafka Connect without Spark Structured
> Streaming
> - Separate Kafka Connector to receive data through network port
> - Data commit and Offset management
> -----------------------------
> Phase-4 : Support for other streaming engines
> - Analysis of Streaming APIs/interface with other streaming engines
> - Implementation of connectors for different streaming engines storm,
> flink , flume, etc.
> ----------------------------
> Phase -5 : In-memory Streaming table (probable feature)
> 1. In-memory Cache for Streaming data
> - Fault tolerant in-memory buffering / checkpoint with WAL
> - Readers read from in-memory tables if available
> - Background threads for writing streaming data ,etc.

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)