Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[jira] [Updated] (CARBONDATA-1072) Streaming Ingestion Feature

Classic

List

Threaded

1 message

Akash R Nilugal (Jira)

[jira] [Updated] (CARBONDATA-1072) Streaming Ingestion Feature

[ https://issues.apache.org/jira/browse/CARBONDATA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Liang Chen updated CARBONDATA-1072:
-----------------------------------
Fix Version/s: (was: 1.2.0)
NONE

> Streaming Ingestion Feature
> ----------------------------
>
> Key: CARBONDATA-1072
> URL: https://issues.apache.org/jira/browse/CARBONDATA-1072
> Project: CarbonData
> Issue Type: New Feature
> Components: core, data-load, data-query, examples, file-format, spark-integration, sql
> Affects Versions: NONE
> Reporter: Aniket Adnaik
> Fix For: NONE
>
>
> High level break down of work Items/Implementation phases:
> Design document will be attached soon.
>
> Phase – 1 – Spark Structured Streaming with regular Carbondata Format
> ----------------------------
> This phase will mainly focus on supporting Streaming ingestion using
> Spark Structured streaming
> 1. Write Path Implementation
> - Integration with Spark’s Structured Streaming framework
> (FileStreamSink etc)
> - StreamingOutputWriter (StreamingOuputWriterFactory)
> - Prepare Write (Schema Validation, Segment creation,
> Streaming file creation etc)
> - StreamingRecordWriter ( Data conversion from Catalyst InternalRow
> to Carbondata compatible format , make use of new load path)
> 2. Read Path Implementation (some overlap with phase-2)
> - Modify getsplits() to read from Streaming Segment
> - Read commited info from meta data to get correct offsets
> - Make use of Min-Max index if available
> - Use sequential scan - data is unsorted , cannot use Btree index
> 3. Compaction
> - Minor Compaction
> - Major Compaction
> 4. Metadata Management
> - Streaming metadata store (e.g. Offsets, timestamps etc.)
>
> 5. Failure Recovery
> - Rollback on failure
> - Handle asynchronous writes to CarbonData (using hflush)
> -----------------------------
> Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
> 1.Streaming File Format
> - Writers use V3 file format for appending Columnar unsorted
> data blockets
> - Modify Readers to read from appendable streaming file format
> -----------------------------
> Phase -3 :
> 1. Inter-opertability Support
> - Functionality with other features/Components
> - Concurrent queries with streaming ingestion
> - Concurrent operations with Streaming Ingestion (e.g. Compaction,
> Alter table, Secondary Index etc.)
> 2. Kafka Connect Ingestion / Carbondata connector
> - Direct ingestion from Kafka Connect without Spark Structured
> Streaming
> - Separate Kafka Connector to receive data through network port
> - Data commit and Offset management
> -----------------------------
> Phase-4 : Support for other streaming engines
> - Analysis of Streaming APIs/interface with other streaming engines
> - Implementation of connectors for different streaming engines storm,
> flink , flume, etc.
> ----------------------------
> Phase -5 : In-memory Streaming table (probable feature)
> 1. In-memory Cache for Streaming data
> - Fault tolerant in-memory buffering / checkpoint with WAL
> - Readers read from in-memory tables if available
> - Background threads for writing streaming data ,etc.

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)