Re: Optimize and refactor insert into command

Posted by sujith chacko on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Optimize-and-refactor-insert-into-command-tp88449p89408.html

@ajantha

Even from carbon to carbon table, the scenarios which i mentioned may be
applicable. as i told above even though the schemas are same
in all aspect but if there is a difference in column properties how you are
going to handle.

If the destination table needs a bad record feature enabled i feel you
shall perform this, Or for better performance you suggest/recommend user to
explicitly disable the unwanted steps
like bad record feature if he/she doesn't care about bad records while
inserting. Hope u got my point.

Implicitly  it will be a risk to determine whether bad records shall be
required or not by assuming source and destination table posses exactly
same schema in all conditions.
I think you shall relook into this part.

Could you share a design document in JIRA or in mail?

On Thu, Jan 2, 2020 at 7:24 AM Ajantha Bhat <[hidden email]> wrote:

> Hi sujith,
>
> I still keep converter step for some scenarios like insert from parquet to
> carbon, we need an optimized converter here to convert from timestamp long
> value (divide by 1000) and convert null values of direct dictionary to 1.
> So, for the scenarios you mentioned, I will be using this flow with
> optimized converter.
>
> For carbon to carbon insert with same source and destination
> properties(this is common scenario in cloud migration) , it goes to no
> converter step and use direct spark internal row till write step.
> Compaction also can use this no converter step.
>
> Thanks,
> Ajantha
>
> On Thu, 2 Jan, 2020, 12:18 am sujith chacko, <[hidden email]>
> wrote:
>
> > Hi Ajantha,
> >
> >    Thanks for your initiative, I have couple of questions even though.
> >
> > a) As per your explanation the dataset validation is already done as part
> > of the source table, this is what you mean? What I understand is the
> insert
> > select queries are going to get some benefits since we don’t do some
> > additional steps.
> >
> > What about if your destination table has some different table properties
> > like few columns may have non null properties or date format or decimal
> > precision’s or scale may be different.
> > So you may need a bad record support then  , how you are going to handle
> > such scenarios? Correct me if I misinterpreted your points.
> >
> > Regards,
> > Sujith
> >
> >
> > On Fri, 20 Dec 2019 at 5:25 AM, Ajantha Bhat <[hidden email]>
> > wrote:
> >
> > > Currently carbondata "insert into" uses the CarbonLoadDataCommand
> itself.
> > > Load process has steps like parsing and converter step with bad record
> > > support.
> > > Insert into doesn't require these steps as data is already validated
> and
> > > converted from source table or dataframe.
> > >
> > > Some identified changes are below.
> > >
> > > 1. Need to refactor and separate load and insert at driver side to skip
> > > converter step and unify flow for No sort and global sort insert.
> > > 2. Need to avoid reorder of each row. By changing select dataframe's
> > > projection order itself during the insert into.
> > > 3. For carbon to carbon insert, need to provide the ReadSupport and use
> > > RecordReader (vector reader currently doesn't support ReadSupport) to
> > > handle null values, time stamp cutoff (direct dictionary) from scanRDD
> > > result.
> > > 4. Need to handle insert into partition/non-partition table in local
> > sort,
> > > global sort, no sort, range columns, compaction flow.
> > >
> > > The final goal is to improve insert performance by keeping only
> required
> > > logic and also decrease the memory footprint.
> > >
> > > If you have any other suggestions or optimizations related to this let
> me
> > > know.
> > >
> > > Thanks,
> > > Ajantha
> > >
> >
>