http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-regrading-design-of-data-load-after-kettle-removal-tp1672p1719.html
possible in present solution. But If we want single pass dataloading
is not possible.
> Hi Ravi,
> We can move mdkey generation step before sorting, this will compress the
> dictionary data and will reduce the IO.
> -Regards
> Kumar Vishal
>
> On Sat, Oct 8, 2016 at 3:30 PM, Ravindra Pesala <
[hidden email]>
> wrote:
>
> > Hi All,
> >
> >
> > Removing kettle from carbondata is necessary as this legacy kettle
> > framework become overhead to carbondata.This discussion is regarding the
> > design of carbon load with out kettle.
> >
> > The main interface for data loading here is DataLoadProcessorStep.
> >
> > */***
> > * * This base interface for data loading. It can do transformation jobs
> as
> > per the implementation.*
> > * **
> > * */*
> > *public interface DataLoadProcessorStep {*
> >
> > * /***
> > * * The output meta for this step. The data returns from this step is
> as
> > per this meta.*
> > * * @return*
> > * */*
> > * DataField[] getOutput();*
> >
> > * /***
> > * * Intialization process for this step.*
> > * * @param configuration*
> > * * @param child*
> > * * @throws CarbonDataLoadingException*
> > * */*
> > * void intialize(CarbonDataLoadConfiguration configuration,
> > DataLoadProcessorStep child) throws*
> > * CarbonDataLoadingException;*
> >
> > * /***
> > * * Tranform the data as per the implemetation.*
> > * * @return Iterator of data*
> > * * @throws CarbonDataLoadingException*
> > * */*
> > * Iterator<Object[]> execute() throws CarbonDataLoadingException;*
> >
> > * /***
> > * * Any closing of resources after step execution can be done here.*
> > * */*
> > * void finish();*
> > *}*
> >
> > The implementation classes for DataLoadProcessorStep are
> > InputProcessorStep, EncoderProcessorStep, SortProcessorStep and
> > DataWriterProcessorStep.
> >
> > The following picture depicts the loading process with implementation
> > classes.
> >
> > [image: Inline images 2]
> >
> > *InputProcessorStep* : It does two jobs, 1. It reads data from
> > RecordReader of InputFormat 2. Parse each field of column as per the data
> > type.
> > *EncoderProcessorStep*: It encodes each field with dictionary if
> > requires.And combine all no dictionary columns to single byte array.
> > *SortProcessorStep* : It sorts the data on dimension columns and write
> > to intermediate files.
> > *DataWriterProcessorStep* : It merge sort the data from intermediate temp
> > files and generate mdk key and writes the data in carbondata format to
> > store.
> >
> >
> >
> > The following interface for Dictionary generation.
> >
> > */***
> > * * Generates dictionary for the column. The implementation classes can
> be
> > pre-defined or*
> > * * local or global dictionary generations.*
> > * */*
> > *public interface ColumnDictionaryGenerator {*
> >
> > * /***
> > * * Generates dictionary value for the column data*
> > * * @param data*
> > * * @return dictionary value*
> > * */*
> > * int generateDictionaryValue(Object data);*
> >
> > * /***
> > * * Returns the actual value associated with dictionary value.*
> > * * @param dictionary*
> > * * @return actual value.*
> > * */*
> > * Object getValueFromDictionary(int dictionary);*
> >
> > * /***
> > * * Returns the maximum value among the dictionary values. It is used
> > for generating mdk key.*
> > * * @return max dictionary value.*
> > * */*
> > * int getMaxDictionaryValue();*
> >
> > *}*
> >
> > This ColumnDictionaryGenerator interface can have 3 implementations, 1.
> > PreGeneratedColumnDictionaryGenerator 2. GlobalColumnDictionaryGenerator
> > 3. LocalColumnDictionaryGenerator
> >
> > [image: Inline images 3]
> >
> > *PreGeneratedColumnDictionaryGenerator* : It gets the dictionary values
> > from already generated and loaded dictionary.
> > *GlobalColumnDictionaryGenerator* : It generates global dictionary
> online
> > by using KV store or distributed map.
> > *LocalColumnDictionaryGenerator* : It generates local dictionary only
> for
> > that executor.
> >
> >
> > For more information on the loading please check the PR
> >
https://github.com/apache/incubator-carbondata/pull/215> >
> > Please let me know any changes are required in these interfaces.
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>