Aniket Adnaik created CARBONDATA-1072:
----------------------------------------- Summary: Streaming Ingestion Feature Key: CARBONDATA-1072 URL: https://issues.apache.org/jira/browse/CARBONDATA-1072 Project: CarbonData Issue Type: New Feature Components: core, data-load, data-query, examples, file-format, spark-integration, sql Affects Versions: 1.1.0 Reporter: Aniket Adnaik Fix For: 1.2.0 High level break down of work Items/Implementation phases: Design document will be attached soon. Phase – 1 – Spark Structured Streaming with regular Carbondata Format ---------------------------- This phase will mainly focus on supporting Streaming ingestion using Spark Structured streaming 1. Write Path Implementation - Integration with Spark’s Structured Streaming framework (FileStreamSink etc) - StreamingOutputWriter (StreamingOuputWriterFactory) - Prepare Write (Schema Validation, Segment creation, Streaming file creation etc) - StreamingRecordWriter ( Data conversion from Catalyst InternalRow to Carbondata compatible format , make use of new load path) 2. Read Path Implementation (some overlap with phase-2) - Modify getsplits() to read from Streaming Segment - Read commited info from meta data to get correct offsets - Make use of Min-Max index if available - Use sequential scan - data is unsorted , cannot use Btree index 3. Compaction - Minor Compaction - Major Compaction 4. Metadata Management - Streaming metadata store (e.g. Offsets, timestamps etc.) 5. Failure Recovery - Rollback on failure - Handle asynchronous writes to CarbonData (using hflush) ---------------------------- Phase – 2 : Spark Structured Streaming with Appendable CarbonData format 1.Streaming File Format - Writers use V3 file format for appending Columnar unsorted data blockets - Modify Readers to read from appendable streaming file format ----------------------------- Phase -3 : 1. Inter-opertability Support - Functionality with other features/Components - Concurrent queries with streaming ingestion - Concurrent operations with Streaming Ingestion (e.g. Compaction, Alter table, Secondary Index etc. 2. Kafka Connect Ingestion / Carbondata connector - Direct ingestion from Kafka Connect without Spark Structured Streaming - Separate Kafka Connector to receive data through network port - Data commit and Offset management ----------------------------- Phase-4 : Support for other streaming engines - Analysis of Streaming APIs/interface with other streaming engines - Implementation of connectors for different streaming engines storm, flink , flume, etc. ----------------------------- Phase -5 : In-memory Streaming table (probable feature) ----------------------------- 1. In-memory Cache for Streaming data - Fault tolerant in-memory buffering / checkpoint with WAL - Readers read from in-memory tables if available - Background threads for writing streaming data ,etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) |
Free forum by Nabble | Edit this page |