[ https://issues.apache.org/jira/browse/CARBONDATA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Adnaik updated CARBONDATA-1072: -------------------------------------- Request participants: (was: ) Description: High level break down of work Items/Implementation phases: Design document will be attached soon. Phase – 1 – Spark Structured Streaming with regular Carbondata Format ---------------------------- This phase will mainly focus on supporting Streaming ingestion using Spark Structured streaming 1. Write Path Implementation - Integration with Spark’s Structured Streaming framework (FileStreamSink etc) - StreamingOutputWriter (StreamingOuputWriterFactory) - Prepare Write (Schema Validation, Segment creation, Streaming file creation etc) - StreamingRecordWriter ( Data conversion from Catalyst InternalRow to Carbondata compatible format , make use of new load path) 2. Read Path Implementation (some overlap with phase-2) - Modify getsplits() to read from Streaming Segment - Read commited info from meta data to get correct offsets - Make use of Min-Max index if available - Use sequential scan - data is unsorted , cannot use Btree index 3. Compaction - Minor Compaction - Major Compaction 4. Metadata Management - Streaming metadata store (e.g. Offsets, timestamps etc.) 5. Failure Recovery - Rollback on failure - Handle asynchronous writes to CarbonData (using hflush) ----------------------------- Phase – 2 : Spark Structured Streaming with Appendable CarbonData format 1.Streaming File Format - Writers use V3 file format for appending Columnar unsorted data blockets - Modify Readers to read from appendable streaming file format ----------------------------- Phase -3 : 1. Inter-opertability Support - Functionality with other features/Components - Concurrent queries with streaming ingestion - Concurrent operations with Streaming Ingestion (e.g. Compaction, Alter table, Secondary Index etc.) 2. Kafka Connect Ingestion / Carbondata connector - Direct ingestion from Kafka Connect without Spark Structured Streaming - Separate Kafka Connector to receive data through network port - Data commit and Offset management ----------------------------- Phase-4 : Support for other streaming engines - Analysis of Streaming APIs/interface with other streaming engines - Implementation of connectors for different streaming engines storm, flink , flume, etc. ---------------------------- Phase -5 : In-memory Streaming table (probable feature) 1. In-memory Cache for Streaming data - Fault tolerant in-memory buffering / checkpoint with WAL - Readers read from in-memory tables if available - Background threads for writing streaming data ,etc. was: High level break down of work Items/Implementation phases: Design document will be attached soon. Phase – 1 – Spark Structured Streaming with regular Carbondata Format ---------------------------- This phase will mainly focus on supporting Streaming ingestion using Spark Structured streaming 1. Write Path Implementation - Integration with Spark’s Structured Streaming framework (FileStreamSink etc) - StreamingOutputWriter (StreamingOuputWriterFactory) - Prepare Write (Schema Validation, Segment creation, Streaming file creation etc) - StreamingRecordWriter ( Data conversion from Catalyst InternalRow to Carbondata compatible format , make use of new load path) 2. Read Path Implementation (some overlap with phase-2) - Modify getsplits() to read from Streaming Segment - Read commited info from meta data to get correct offsets - Make use of Min-Max index if available - Use sequential scan - data is unsorted , cannot use Btree index 3. Compaction - Minor Compaction - Major Compaction 4. Metadata Management - Streaming metadata store (e.g. Offsets, timestamps etc.) 5. Failure Recovery - Rollback on failure - Handle asynchronous writes to CarbonData (using hflush) ---------------------------- Phase – 2 : Spark Structured Streaming with Appendable CarbonData format 1.Streaming File Format - Writers use V3 file format for appending Columnar unsorted data blockets - Modify Readers to read from appendable streaming file format ----------------------------- Phase -3 : 1. Inter-opertability Support - Functionality with other features/Components - Concurrent queries with streaming ingestion - Concurrent operations with Streaming Ingestion (e.g. Compaction, Alter table, Secondary Index etc. 2. Kafka Connect Ingestion / Carbondata connector - Direct ingestion from Kafka Connect without Spark Structured Streaming - Separate Kafka Connector to receive data through network port - Data commit and Offset management ----------------------------- Phase-4 : Support for other streaming engines - Analysis of Streaming APIs/interface with other streaming engines - Implementation of connectors for different streaming engines storm, flink , flume, etc. ----------------------------- Phase -5 : In-memory Streaming table (probable feature) ----------------------------- 1. In-memory Cache for Streaming data - Fault tolerant in-memory buffering / checkpoint with WAL - Readers read from in-memory tables if available - Background threads for writing streaming data ,etc. > Streaming Ingestion Feature > ---------------------------- > > Key: CARBONDATA-1072 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1072 > Project: CarbonData > Issue Type: New Feature > Components: core, data-load, data-query, examples, file-format, spark-integration, sql > Affects Versions: 1.1.0 > Reporter: Aniket Adnaik > Fix For: 1.2.0 > > > High level break down of work Items/Implementation phases: > Design document will be attached soon. > > Phase – 1 – Spark Structured Streaming with regular Carbondata Format > ---------------------------- > This phase will mainly focus on supporting Streaming ingestion using > Spark Structured streaming > 1. Write Path Implementation > - Integration with Spark’s Structured Streaming framework > (FileStreamSink etc) > - StreamingOutputWriter (StreamingOuputWriterFactory) > - Prepare Write (Schema Validation, Segment creation, > Streaming file creation etc) > - StreamingRecordWriter ( Data conversion from Catalyst InternalRow > to Carbondata compatible format , make use of new load path) > 2. Read Path Implementation (some overlap with phase-2) > - Modify getsplits() to read from Streaming Segment > - Read commited info from meta data to get correct offsets > - Make use of Min-Max index if available > - Use sequential scan - data is unsorted , cannot use Btree index > 3. Compaction > - Minor Compaction > - Major Compaction > 4. Metadata Management > - Streaming metadata store (e.g. Offsets, timestamps etc.) > > 5. Failure Recovery > - Rollback on failure > - Handle asynchronous writes to CarbonData (using hflush) > ----------------------------- > Phase – 2 : Spark Structured Streaming with Appendable CarbonData format > 1.Streaming File Format > - Writers use V3 file format for appending Columnar unsorted > data blockets > - Modify Readers to read from appendable streaming file format > ----------------------------- > Phase -3 : > 1. Inter-opertability Support > - Functionality with other features/Components > - Concurrent queries with streaming ingestion > - Concurrent operations with Streaming Ingestion (e.g. Compaction, > Alter table, Secondary Index etc.) > 2. Kafka Connect Ingestion / Carbondata connector > - Direct ingestion from Kafka Connect without Spark Structured > Streaming > - Separate Kafka Connector to receive data through network port > - Data commit and Offset management > ----------------------------- > Phase-4 : Support for other streaming engines > - Analysis of Streaming APIs/interface with other streaming engines > - Implementation of connectors for different streaming engines storm, > flink , flume, etc. > ---------------------------- > Phase -5 : In-memory Streaming table (probable feature) > 1. In-memory Cache for Streaming data > - Fault tolerant in-memory buffering / checkpoint with WAL > - Readers read from in-memory tables if available > - Background threads for writing streaming data ,etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) |
Free forum by Nabble | Edit this page |