http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4661.html
performance benefit if dictionary files are already generated.
> +1 to have separate output formats, now user can have flexibility to choose
> as per scenario.
>
> On Fri, Dec 16, 2016, 2:47 AM Jihong Ma <
[hidden email]> wrote:
>
> >
> > It is great idea to have separate OutputFormat for regular Carbon data
> > files, index files as well as meta data files, For instance: dictionary
> > file, schema file, global index file etc.. for writing Carbon generated
> > files laid out HDFS, and it is orthogonal to the actual data load
> process.
> >
> > Regards.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Jacky Li [mailto:
[hidden email]]
> > Sent: Thursday, December 15, 2016 12:55 AM
> > To:
[hidden email]
> > Subject: [DISCUSSION] CarbonData loading solution discussion
> >
> >
> > Hi community,
> >
> > Since CarbonData has global dictionary feature, currently when loading
> > data to CarbonData, it requires two times of scan of the input data.
> First
> > scan is to generate dictionary, second scan to do actual data encoding
> and
> > write to carbon files. Obviously, this approach is simple, but this
> > approach has at least two problem:
> > 1. involve unnecessary IO read.
> > 2. need two jobs for MapReduce application to write carbon files
> >
> > To solve this, we need single-pass data loading solution, as discussed
> > earlier, and now community is developing it (CARBONDATA-401, PR310).
> >
> > In this post, I want to discuss the OutputFormat part, I think there will
> > be two OutputFormat for CarbonData.
> > 1. DictionaryOutputFormat, which is used for the global dictionary
> > generation. (This should be extracted from CarbonColumnDictGeneratRDD)
> > 2. TableOutputFormat, which is used for writing CarbonData files.
> >
> > When carbon has these output formats, it is more easier to integrate with
> > compute framework like spark, hive, mapreduce.
> > And in order to make data loading faster, user can choose different
> > solution based on its scenario as following
> > Scenario 1: First load is small (can not cover most dictionary)
> >
> > run two jobs that use DictionaryOutputFormat and TableOutputFormat
> > accordingly, in first few loads
> > after some loads, it becomes like Scenario 2, run one job that use
> > TableOutputFormat with single-pass
> > Scenario 2: First load is big (can cover most dictionary)
> >
> > for first load
> > if the bigest column cardinality > 10K, run two jobs using two output
> > formats
> > otherwise, run one job that use TableOutputFormat with single-pass
> > for subsequent load, run one job that use TableOutputFormat with
> > single-pass
> > What do yo think this idea?
> >
> > Regards,
> > Jacky
> >
>