Login  Register

Re: Discussion regrading design of data load after kettle removal.

Posted by ravipesala on Oct 10, 2016; 3:54am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-regrading-design-of-data-load-after-kettle-removal-tp1672p1719.html

Hi Vishal,

You are right, but it is possible only if the dictionary is already
generated and cardinality of each column is already known, so that is
possible in present solution.  But If we want single pass dataloading
solution then we require to generate global dictionary online(by using KV
store or distributed map), in this case generating mdk key before sort step
is not possible.

Regards,
Ravi

On 8 October 2016 at 21:02, Kumar Vishal <[hidden email]> wrote:

> Hi Ravi,
> We can move mdkey generation step before sorting, this will compress the
> dictionary data and will reduce the IO.
> -Regards
> Kumar Vishal
>
> On Sat, Oct 8, 2016 at 3:30 PM, Ravindra Pesala <[hidden email]>
> wrote:
>
> > Hi All,
> >
> >
> > Removing kettle from carbondata is necessary as this legacy kettle
> > framework become overhead to carbondata.This discussion is regarding the
> > design of carbon load with out kettle.
> >
> > The main interface for data loading here is DataLoadProcessorStep.
> >
> > */***
> > * * This base interface for data loading. It can do transformation jobs
> as
> > per the implementation.*
> > * **
> > * */*
> > *public interface DataLoadProcessorStep {*
> >
> > *  /***
> > *   * The output meta for this step. The data returns from this step is
> as
> > per this meta.*
> > *   * @return*
> > *   */*
> > *  DataField[] getOutput();*
> >
> > *  /***
> > *   * Intialization process for this step.*
> > *   * @param configuration*
> > *   * @param child*
> > *   * @throws CarbonDataLoadingException*
> > *   */*
> > *  void intialize(CarbonDataLoadConfiguration configuration,
> > DataLoadProcessorStep child) throws*
> > *      CarbonDataLoadingException;*
> >
> > *  /***
> > *   * Tranform the data as per the implemetation.*
> > *   * @return Iterator of data*
> > *   * @throws CarbonDataLoadingException*
> > *   */*
> > *  Iterator<Object[]> execute() throws CarbonDataLoadingException;*
> >
> > *  /***
> > *   * Any closing of resources after step execution can be done here.*
> > *   */*
> > *  void finish();*
> > *}*
> >
> > The implementation classes for DataLoadProcessorStep are
> > InputProcessorStep, EncoderProcessorStep, SortProcessorStep and
> > DataWriterProcessorStep.
> >
> > The following picture depicts the loading process with implementation
> > classes.
> >
> > [image: Inline images 2]
> >
> > *InputProcessorStep* :  It does two jobs, 1. It reads data from
> > RecordReader of InputFormat 2. Parse each field of column as per the data
> > type.
> > *EncoderProcessorStep*: It encodes each field with dictionary if
> > requires.And combine all no dictionary columns to single byte array.
> > *SortProcessorStep* :   It sorts the data on dimension columns and write
> > to intermediate files.
> > *DataWriterProcessorStep* : It merge sort the data from intermediate temp
> > files and generate mdk key and writes the data in carbondata format to
> > store.
> >
> >
> >
> > The following interface for Dictionary generation.
> >
> > */***
> > * * Generates dictionary for the column. The implementation classes can
> be
> > pre-defined or*
> > * * local or global dictionary generations.*
> > * */*
> > *public interface ColumnDictionaryGenerator {*
> >
> > *  /***
> > *   * Generates dictionary value for the column data*
> > *   * @param data*
> > *   * @return dictionary value*
> > *   */*
> > *  int generateDictionaryValue(Object data);*
> >
> > *  /***
> > *   * Returns the actual value associated with dictionary value.*
> > *   * @param dictionary*
> > *   * @return actual value.*
> > *   */*
> > *  Object getValueFromDictionary(int dictionary);*
> >
> > *  /***
> > *   * Returns the maximum value among the dictionary values. It is used
> > for generating mdk key.*
> > *   * @return max dictionary value.*
> > *   */*
> > *  int getMaxDictionaryValue();*
> >
> > *}*
> >
> > This ColumnDictionaryGenerator interface can have 3 implementations, 1.
> > PreGeneratedColumnDictionaryGenerator 2. GlobalColumnDictionaryGenerator
> > 3. LocalColumnDictionaryGenerator
> >
> > [image: Inline images 3]
> >
> > *PreGeneratedColumnDictionaryGenerator* : It gets the dictionary values
> > from already generated and loaded dictionary.
> > *GlobalColumnDictionaryGenerator* : It generates global dictionary
> online
> > by using KV store or distributed map.
> > *LocalColumnDictionaryGenerator* : It generates local dictionary only
> for
> > that executor.
> >
> >
> > For more information on the loading please check the PR
> > https://github.com/apache/incubator-carbondata/pull/215
> >
> > Please let me know any changes are required in these interfaces.
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>



--
Thanks & Regards,
Ravi