Login  Register

Re: [DISCUSSION] CarbonData loading solution discussion

Posted by kumarvishal09 on Dec 19, 2016; 9:54am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4661.html

+1
Now user will have flexibility to choose the output format.Will get
performance benefit if dictionary files are already generated.

-Regards
Kumar Vishal


On Fri, Dec 16, 2016 at 10:19 AM, Ravindra Pesala <[hidden email]>
wrote:

> +1 to have separate output formats, now user can have flexibility to choose
> as per scenario.
>
> On Fri, Dec 16, 2016, 2:47 AM Jihong Ma <[hidden email]> wrote:
>
> >
> > It is great idea to have separate OutputFormat for regular Carbon data
> > files, index files as well as meta data files, For instance: dictionary
> > file, schema file, global index file etc.. for writing Carbon generated
> > files laid out HDFS, and it is orthogonal to the actual data load
> process.
> >
> > Regards.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Jacky Li [mailto:[hidden email]]
> > Sent: Thursday, December 15, 2016 12:55 AM
> > To: [hidden email]
> > Subject: [DISCUSSION] CarbonData loading solution discussion
> >
> >
> > Hi community,
> >
> > Since CarbonData has global dictionary feature, currently when loading
> > data to CarbonData, it requires two times of scan of the input data.
> First
> > scan is to generate dictionary, second scan to do actual data encoding
> and
> > write to carbon files. Obviously, this approach is simple, but this
> > approach has at least two problem:
> > 1. involve unnecessary IO read.
> > 2. need two jobs for MapReduce application to write carbon files
> >
> > To solve this, we need single-pass data loading solution, as discussed
> > earlier, and now community is developing it (CARBONDATA-401, PR310).
> >
> > In this post, I want to discuss the OutputFormat part, I think there will
> > be two OutputFormat for CarbonData.
> > 1. DictionaryOutputFormat, which is used for the global dictionary
> > generation. (This should be extracted from CarbonColumnDictGeneratRDD)
> > 2. TableOutputFormat, which is used for writing CarbonData files.
> >
> > When carbon has these output formats, it is more easier to integrate with
> > compute framework like spark, hive, mapreduce.
> > And in order to make data loading faster, user can choose different
> > solution based on its scenario as following
> > Scenario 1:  First load is small (can not cover most dictionary)
> >
> > run two jobs that use DictionaryOutputFormat and TableOutputFormat
> > accordingly, in first few loads
> > after some loads, it becomes like Scenario 2, run one job that use
> > TableOutputFormat with single-pass
> > Scenario 2: First load is big (can cover most dictionary)
> >
> > for first load
> > if the bigest column cardinality > 10K, run two jobs using two output
> > formats
> > otherwise, run one job that use TableOutputFormat with single-pass
> > for subsequent load, run one job that use TableOutputFormat with
> > single-pass
> > What do yo think this idea?
> >
> > Regards,
> > Jacky
> >
>
kumar vishal