[jira] [Updated] (CARBONDATA-1072) Streaming Ingestion Feature

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CARBONDATA-1072) Streaming Ingestion Feature

Akash R Nilugal (Jira)

     [ https://issues.apache.org/jira/browse/CARBONDATA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Adnaik updated CARBONDATA-1072:
--------------------------------------
    Request participants:   (was: )
             Description:
High level break down of work Items/Implementation phases:
Design document will be attached soon.
 
Phase – 1 – Spark Structured Streaming with regular Carbondata Format
----------------------------
    This phase will mainly focus on supporting Streaming ingestion using
    Spark Structured streaming
    1. Write Path Implementation
       - Integration with Spark’s Structured Streaming framework  
           (FileStreamSink etc)
       - StreamingOutputWriter (StreamingOuputWriterFactory)
       - Prepare Write  (Schema Validation, Segment creation,
          Streaming file creation etc)
       - StreamingRecordWriter ( Data conversion from Catalyst InternalRow
         to Carbondata compatible format , make use of new load path)

     2. Read Path Implementation (some overlap with phase-2)
      - Modify getsplits() to read from Streaming Segment
      - Read commited info from meta data to get correct offsets
      - Make use of Min-Max index if available
      - Use sequential scan - data is unsorted , cannot use Btree index

    3. Compaction
     - Minor Compaction
     - Major Compaction

   4. Metadata Management
     - Streaming metadata store (e.g. Offsets, timestamps etc.)
   
   5. Failure Recovery
      - Rollback on failure
      - Handle asynchronous writes to CarbonData (using hflush)
-----------------------------
Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
     1.Streaming File Format
     - Writers use V3 file format for appending Columnar unsorted
       data blockets
     - Modify Readers to read from appendable streaming file format
-----------------------------
Phase -3 :
    1. Inter-opertability Support
     - Functionality with other features/Components
     - Concurrent queries with streaming ingestion
     - Concurrent operations with Streaming Ingestion (e.g. Compaction,
      Alter table, Secondary Index etc.)
    2. Kafka Connect Ingestion / Carbondata connector
     - Direct ingestion from Kafka Connect without Spark Structured
        Streaming
     - Separate Kafka  Connector to receive data through network port
     - Data commit and Offset management
-----------------------------
Phase-4 : Support for other streaming engines
    - Analysis of Streaming APIs/interface  with other streaming engines
    - Implementation of connectors  for different streaming engines storm,
       flink , flume, etc.
----------------------------
Phase -5 : In-memory Streaming table (probable feature)
   1. In-memory Cache for Streaming data
     - Fault tolerant  in-memory buffering / checkpoint with WAL
     - Readers read from in-memory tables if available
     - Background threads for writing streaming data ,etc.


  was:
High level break down of work Items/Implementation phases:
Design document will be attached soon.
 
Phase – 1 – Spark Structured Streaming with regular Carbondata Format
----------------------------
    This phase will mainly focus on supporting Streaming ingestion using
    Spark Structured streaming
    1. Write Path Implementation
       - Integration with Spark’s Structured Streaming framework  
           (FileStreamSink etc)
       - StreamingOutputWriter (StreamingOuputWriterFactory)
       - Prepare Write  (Schema Validation, Segment creation,
          Streaming file creation etc)
       - StreamingRecordWriter ( Data conversion from Catalyst InternalRow
         to Carbondata compatible format , make use of new load path)

     2. Read Path Implementation (some overlap with phase-2)
      - Modify getsplits() to read from Streaming Segment
      - Read commited info from meta data to get correct offsets
      - Make use of Min-Max index if available
      - Use sequential scan - data is unsorted , cannot use Btree index

    3. Compaction
     - Minor Compaction
     - Major Compaction

   4. Metadata Management
     - Streaming metadata store (e.g. Offsets, timestamps etc.)
   
   5. Failure Recovery
      - Rollback on failure
      - Handle asynchronous writes to CarbonData (using hflush)
----------------------------
Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
     1.Streaming File Format
       - Writers use V3 file format for appending Columnar unsorted
         data blockets
      - Modify Readers to read from appendable streaming file format
-----------------------------
Phase -3 :
     1. Inter-opertability Support
       - Functionality with other features/Components
       - Concurrent queries with streaming ingestion
       - Concurrent operations with Streaming Ingestion (e.g. Compaction,
          Alter table, Secondary Index etc.
    2. Kafka Connect Ingestion / Carbondata connector
      - Direct ingestion from Kafka Connect without Spark Structured
        Streaming
      - Separate Kafka  Connector to receive data through network port
      - Data commit and Offset management
-----------------------------
Phase-4 : Support for other streaming engines
     -  Analysis of Streaming APIs/interface  with other streaming engines
     - Implementation of connectors  for different streaming engines storm,
       flink , flume, etc.

-----------------------------
Phase -5 : In-memory Streaming table (probable feature)
-----------------------------
   1. In-memory Cache for Streaming data
     - Fault tolerant  in-memory buffering / checkpoint with WAL
     - Readers read from in-memory tables if available
     - Background threads for writing streaming data ,etc.



> Streaming Ingestion Feature
> ----------------------------
>
>                 Key: CARBONDATA-1072
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1072
>             Project: CarbonData
>          Issue Type: New Feature
>          Components: core, data-load, data-query, examples, file-format, spark-integration, sql
>    Affects Versions: 1.1.0
>            Reporter: Aniket Adnaik
>             Fix For: 1.2.0
>
>
> High level break down of work Items/Implementation phases:
> Design document will be attached soon.
>  
> Phase – 1 – Spark Structured Streaming with regular Carbondata Format
> ----------------------------
>     This phase will mainly focus on supporting Streaming ingestion using
>     Spark Structured streaming
>     1. Write Path Implementation
>        - Integration with Spark’s Structured Streaming framework  
>            (FileStreamSink etc)
>        - StreamingOutputWriter (StreamingOuputWriterFactory)
>        - Prepare Write  (Schema Validation, Segment creation,
>           Streaming file creation etc)
>        - StreamingRecordWriter ( Data conversion from Catalyst InternalRow
>          to Carbondata compatible format , make use of new load path)
>      2. Read Path Implementation (some overlap with phase-2)
>       - Modify getsplits() to read from Streaming Segment
>       - Read commited info from meta data to get correct offsets
>       - Make use of Min-Max index if available
>       - Use sequential scan - data is unsorted , cannot use Btree index
>     3. Compaction
>      - Minor Compaction
>      - Major Compaction
>    4. Metadata Management
>      - Streaming metadata store (e.g. Offsets, timestamps etc.)
>    
>    5. Failure Recovery
>       - Rollback on failure
>       - Handle asynchronous writes to CarbonData (using hflush)
> -----------------------------
> Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
>      1.Streaming File Format
>      - Writers use V3 file format for appending Columnar unsorted
>        data blockets
>      - Modify Readers to read from appendable streaming file format
> -----------------------------
> Phase -3 :
>     1. Inter-opertability Support
>      - Functionality with other features/Components
>      - Concurrent queries with streaming ingestion
>      - Concurrent operations with Streaming Ingestion (e.g. Compaction,
>       Alter table, Secondary Index etc.)
>     2. Kafka Connect Ingestion / Carbondata connector
>      - Direct ingestion from Kafka Connect without Spark Structured
>         Streaming
>      - Separate Kafka  Connector to receive data through network port
>      - Data commit and Offset management
> -----------------------------
> Phase-4 : Support for other streaming engines
>     - Analysis of Streaming APIs/interface  with other streaming engines
>     - Implementation of connectors  for different streaming engines storm,
>        flink , flume, etc.
> ----------------------------
> Phase -5 : In-memory Streaming table (probable feature)
>    1. In-memory Cache for Streaming data
>      - Fault tolerant  in-memory buffering / checkpoint with WAL
>      - Readers read from in-memory tables if available
>      - Background threads for writing streaming data ,etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)