[ https://issues.apache.org/jira/browse/CARBONDATA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Chen updated CARBONDATA-1072: ----------------------------------- Fix Version/s: (was: 1.2.0) NONE > Streaming Ingestion Feature > ---------------------------- > > Key: CARBONDATA-1072 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1072 > Project: CarbonData > Issue Type: New Feature > Components: core, data-load, data-query, examples, file-format, spark-integration, sql > Affects Versions: NONE > Reporter: Aniket Adnaik > Fix For: NONE > > > High level break down of work Items/Implementation phases: > Design document will be attached soon. > > Phase – 1 – Spark Structured Streaming with regular Carbondata Format > ---------------------------- > This phase will mainly focus on supporting Streaming ingestion using > Spark Structured streaming > 1. Write Path Implementation > - Integration with Spark’s Structured Streaming framework > (FileStreamSink etc) > - StreamingOutputWriter (StreamingOuputWriterFactory) > - Prepare Write (Schema Validation, Segment creation, > Streaming file creation etc) > - StreamingRecordWriter ( Data conversion from Catalyst InternalRow > to Carbondata compatible format , make use of new load path) > 2. Read Path Implementation (some overlap with phase-2) > - Modify getsplits() to read from Streaming Segment > - Read commited info from meta data to get correct offsets > - Make use of Min-Max index if available > - Use sequential scan - data is unsorted , cannot use Btree index > 3. Compaction > - Minor Compaction > - Major Compaction > 4. Metadata Management > - Streaming metadata store (e.g. Offsets, timestamps etc.) > > 5. Failure Recovery > - Rollback on failure > - Handle asynchronous writes to CarbonData (using hflush) > ----------------------------- > Phase – 2 : Spark Structured Streaming with Appendable CarbonData format > 1.Streaming File Format > - Writers use V3 file format for appending Columnar unsorted > data blockets > - Modify Readers to read from appendable streaming file format > ----------------------------- > Phase -3 : > 1. Inter-opertability Support > - Functionality with other features/Components > - Concurrent queries with streaming ingestion > - Concurrent operations with Streaming Ingestion (e.g. Compaction, > Alter table, Secondary Index etc.) > 2. Kafka Connect Ingestion / Carbondata connector > - Direct ingestion from Kafka Connect without Spark Structured > Streaming > - Separate Kafka Connector to receive data through network port > - Data commit and Offset management > ----------------------------- > Phase-4 : Support for other streaming engines > - Analysis of Streaming APIs/interface with other streaming engines > - Implementation of connectors for different streaming engines storm, > flink , flume, etc. > ---------------------------- > Phase -5 : In-memory Streaming table (probable feature) > 1. In-memory Cache for Streaming data > - Fault tolerant in-memory buffering / checkpoint with WAL > - Readers read from in-memory tables if available > - Background threads for writing streaming data ,etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) |
Free forum by Nabble | Edit this page |