Apache CarbonData Dev Mailing List archive

RE: [DISCUSSION] CarbonData loading solution discussion

Posted by Jihong Ma on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4542.html

It is great idea to have separate OutputFormat for regular Carbon data files, index files as well as meta data files, For instance: dictionary file, schema file, global index file etc.. for writing Carbon generated files laid out HDFS, and it is orthogonal to the actual data load process.

Regards.

Jihong

-----Original Message-----
From: Jacky Li [mailto:[hidden email]]
Sent: Thursday, December 15, 2016 12:55 AM
To: [hidden email]
Subject: [DISCUSSION] CarbonData loading solution discussion

Hi community,

Since CarbonData has global dictionary feature, currently when loading data to CarbonData, it requires two times of scan of the input data. First scan is to generate dictionary, second scan to do actual data encoding and write to carbon files. Obviously, this approach is simple, but this approach has at least two problem:
1. involve unnecessary IO read.
2. need two jobs for MapReduce application to write carbon files

To solve this, we need single-pass data loading solution, as discussed earlier, and now community is developing it (CARBONDATA-401, PR310).

In this post, I want to discuss the OutputFormat part, I think there will be two OutputFormat for CarbonData.
1. DictionaryOutputFormat, which is used for the global dictionary generation. (This should be extracted from CarbonColumnDictGeneratRDD)
2. TableOutputFormat, which is used for writing CarbonData files.

When carbon has these output formats, it is more easier to integrate with compute framework like spark, hive, mapreduce.
And in order to make data loading faster, user can choose different solution based on its scenario as following
Scenario 1: First load is small (can not cover most dictionary)

run two jobs that use DictionaryOutputFormat and TableOutputFormat accordingly, in first few loads
after some loads, it becomes like Scenario 2, run one job that use TableOutputFormat with single-pass
Scenario 2: First load is big (can cover most dictionary)

for first load
if the bigest column cardinality > 10K, run two jobs using two output formats
otherwise, run one job that use TableOutputFormat with single-pass
for subsequent load, run one job that use TableOutputFormat with single-pass
What do yo think this idea?

Regards,
Jacky