Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[jira] [Commented] (CARBONDATA-3130) CarbonData Support Flink

Classic

List

Threaded

1 message

Akash R Nilugal (Jira)

[jira] [Commented] (CARBONDATA-3130) CarbonData Support Flink

[ https://issues.apache.org/jira/browse/CARBONDATA-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712447#comment-16712447 ]

Nicholas Jiang commented on CARBONDATA-3130:
--------------------------------------------

# Each flink process, use the SDK to create a directory of its own online segment (use UUID for the directory name)
# Then continuously write new data files to this directory. There is also an index file in this directory to record which data files are valid.
# When the online segment directory reaches a certain size, the handoff action is triggered, that is, the table status metadata is modified. The SDK is then responsible for creating a new online segment directory. Then repeat step 2.
# For the online segment query, first read the index file to get a list of valid data files, and then read each file.
* The role of the index file is to avoid reading half of the flush data file when querying.In the index file, you need to record a list of valid data files, or you can add some minmax statistics.When reading, first read the index file, get the data file path, list of path, and then read these files.The name of the valid data file in the current online segment directory is written in the index file.
* Each process has an online segment directory to ensure that each process can write concurrently. This mechanism can be used in scenarios without a central collaborator, such as flink, kafka stream, cassandra, etc.
* Reading while writing, refers to one side using flink into the library, while using spark/presto query, in this case can not let the query side read to not write a complete data file.
* The essential difference between online segment and stream segment is that the former is the process level (no scheduling, multiple active), the latter is application level (with scheduling, only one active).

> CarbonData Support Flink
> ------------------------
>
> Key: CARBONDATA-3130
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3130
> Project: CarbonData
> Issue Type: New Feature
> Components: flink-integration
> Reporter: Nicholas Jiang
> Assignee: Nicholas Jiang
> Priority: Minor
>
> For streaming warehousing scenarios，CarbonData support flink.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)