Apache CarbonData Dev Mailing List archive

Optimize and refactor insert into command

Classic

List

Threaded

6 messages Options

Ajantha Bhat

Optimize and refactor insert into command

Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
Load process has steps like parsing and converter step with bad record
support.
Insert into doesn't require these steps as data is already validated and
converted from source table or dataframe.

Some identified changes are below.

1. Need to refactor and separate load and insert at driver side to skip
converter step and unify flow for No sort and global sort insert.
2. Need to avoid reorder of each row. By changing select dataframe's
projection order itself during the insert into.
3. For carbon to carbon insert, need to provide the ReadSupport and use
RecordReader (vector reader currently doesn't support ReadSupport) to
handle null values, time stamp cutoff (direct dictionary) from scanRDD
result.
4. Need to handle insert into partition/non-partition table in local sort,
global sort, no sort, range columns, compaction flow.

The final goal is to improve insert performance by keeping only required
logic and also decrease the memory footprint.

If you have any other suggestions or optimizations related to this let me
know.

Thanks,
Ajantha

Jacky Li

Re: Optimize and refactor insert into command

Definitely +1, please feel free to create JIRA issue and PR

Regards,
Jacky

> 2019年12月20日上午7:55，Ajantha Bhat <[hidden email]> 写道：
>
> Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
> Load process has steps like parsing and converter step with bad record
> support.
> Insert into doesn't require these steps as data is already validated and
> converted from source table or dataframe.
>
> Some identified changes are below.
>
> 1. Need to refactor and separate load and insert at driver side to skip
> converter step and unify flow for No sort and global sort insert.
> 2. Need to avoid reorder of each row. By changing select dataframe's
> projection order itself during the insert into.
> 3. For carbon to carbon insert, need to provide the ReadSupport and use
> RecordReader (vector reader currently doesn't support ReadSupport) to
> handle null values, time stamp cutoff (direct dictionary) from scanRDD
> result.
> 4. Need to handle insert into partition/non-partition table in local sort,
> global sort, no sort, range columns, compaction flow.
>
> The final goal is to improve insert performance by keeping only required
> logic and also decrease the memory footprint.
>
> If you have any other suggestions or optimizations related to this let me
> know.
>
> Thanks,
> Ajantha

sujith chacko

Re: Optimize and refactor insert into command

In reply to this post by Ajantha Bhat

Hi Ajantha,

Thanks for your initiative, I have couple of questions even though.

a) As per your explanation the dataset validation is already done as part
of the source table, this is what you mean? What I understand is the insert
select queries are going to get some benefits since we don’t do some
additional steps.

What about if your destination table has some different table properties
like few columns may have non null properties or date format or decimal
precision’s or scale may be different.
So you may need a bad record support then , how you are going to handle
such scenarios? Correct me if I misinterpreted your points.

Regards,
Sujith

On Fri, 20 Dec 2019 at 5:25 AM, Ajantha Bhat <[hidden email]> wrote:

> Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
> Load process has steps like parsing and converter step with bad record
> support.
> Insert into doesn't require these steps as data is already validated and
> converted from source table or dataframe.
>
> Some identified changes are below.
>
> 1. Need to refactor and separate load and insert at driver side to skip
> converter step and unify flow for No sort and global sort insert.
> 2. Need to avoid reorder of each row. By changing select dataframe's
> projection order itself during the insert into.
> 3. For carbon to carbon insert, need to provide the ReadSupport and use
> RecordReader (vector reader currently doesn't support ReadSupport) to
> handle null values, time stamp cutoff (direct dictionary) from scanRDD
> result.
> 4. Need to handle insert into partition/non-partition table in local sort,
> global sort, no sort, range columns, compaction flow.
>
> The final goal is to improve insert performance by keeping only required
> logic and also decrease the memory footprint.
>
> If you have any other suggestions or optimizations related to this let me
> know.
>
> Thanks,
> Ajantha
>

Ajantha Bhat

Re: Optimize and refactor insert into command

Hi sujith,

I still keep converter step for some scenarios like insert from parquet to
carbon, we need an optimized converter here to convert from timestamp long
value (divide by 1000) and convert null values of direct dictionary to 1.
So, for the scenarios you mentioned, I will be using this flow with
optimized converter.

For carbon to carbon insert with same source and destination
properties(this is common scenario in cloud migration) , it goes to no
converter step and use direct spark internal row till write step.
Compaction also can use this no converter step.

Thanks,
Ajantha

On Thu, 2 Jan, 2020, 12:18 am sujith chacko, <[hidden email]>
wrote:

> Hi Ajantha,
>
> Thanks for your initiative, I have couple of questions even though.
>
> a) As per your explanation the dataset validation is already done as part
> of the source table, this is what you mean? What I understand is the insert
> select queries are going to get some benefits since we don’t do some
> additional steps.
>
> What about if your destination table has some different table properties
> like few columns may have non null properties or date format or decimal
> precision’s or scale may be different.
> So you may need a bad record support then , how you are going to handle
> such scenarios? Correct me if I misinterpreted your points.
>
> Regards,
> Sujith
>
>
> On Fri, 20 Dec 2019 at 5:25 AM, Ajantha Bhat <[hidden email]>
> wrote:
>
> > Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
> > Load process has steps like parsing and converter step with bad record
> > support.
> > Insert into doesn't require these steps as data is already validated and
> > converted from source table or dataframe.
> >
> > Some identified changes are below.
> >
> > 1. Need to refactor and separate load and insert at driver side to skip
> > converter step and unify flow for No sort and global sort insert.
> > 2. Need to avoid reorder of each row. By changing select dataframe's
> > projection order itself during the insert into.
> > 3. For carbon to carbon insert, need to provide the ReadSupport and use
> > RecordReader (vector reader currently doesn't support ReadSupport) to
> > handle null values, time stamp cutoff (direct dictionary) from scanRDD
> > result.
> > 4. Need to handle insert into partition/non-partition table in local
> sort,
> > global sort, no sort, range columns, compaction flow.
> >
> > The final goal is to improve insert performance by keeping only required
> > logic and also decrease the memory footprint.
> >
> > If you have any other suggestions or optimizations related to this let me
> > know.
> >
> > Thanks,
> > Ajantha
> >
>

sujith chacko

Re: Optimize and refactor insert into command

@ajantha

Even from carbon to carbon table, the scenarios which i mentioned may be
applicable. as i told above even though the schemas are same
in all aspect but if there is a difference in column properties how you are
going to handle.

If the destination table needs a bad record feature enabled i feel you
shall perform this, Or for better performance you suggest/recommend user to
explicitly disable the unwanted steps
like bad record feature if he/she doesn't care about bad records while
inserting. Hope u got my point.

Implicitly it will be a risk to determine whether bad records shall be
required or not by assuming source and destination table posses exactly
same schema in all conditions.
I think you shall relook into this part.

Could you share a design document in JIRA or in mail?

On Thu, Jan 2, 2020 at 7:24 AM Ajantha Bhat <[hidden email]> wrote:

> Hi sujith,
>
> I still keep converter step for some scenarios like insert from parquet to
> carbon, we need an optimized converter here to convert from timestamp long
> value (divide by 1000) and convert null values of direct dictionary to 1.
> So, for the scenarios you mentioned, I will be using this flow with
> optimized converter.
>
> For carbon to carbon insert with same source and destination
> properties(this is common scenario in cloud migration) , it goes to no
> converter step and use direct spark internal row till write step.
> Compaction also can use this no converter step.
>
> Thanks,
> Ajantha
>
> On Thu, 2 Jan, 2020, 12:18 am sujith chacko, <[hidden email]>
> wrote:
>
> > Hi Ajantha,
> >
> > Thanks for your initiative, I have couple of questions even though.
> >
> > a) As per your explanation the dataset validation is already done as part
> > of the source table, this is what you mean? What I understand is the
> insert
> > select queries are going to get some benefits since we don’t do some
> > additional steps.
> >
> > What about if your destination table has some different table properties
> > like few columns may have non null properties or date format or decimal
> > precision’s or scale may be different.
> > So you may need a bad record support then , how you are going to handle
> > such scenarios? Correct me if I misinterpreted your points.
> >
> > Regards,
> > Sujith
> >
> >
> > On Fri, 20 Dec 2019 at 5:25 AM, Ajantha Bhat <[hidden email]>
> > wrote:
> >
> > > Currently carbondata "insert into" uses the CarbonLoadDataCommand
> itself.
> > > Load process has steps like parsing and converter step with bad record
> > > support.
> > > Insert into doesn't require these steps as data is already validated
> and
> > > converted from source table or dataframe.
> > >
> > > Some identified changes are below.
> > >
> > > 1. Need to refactor and separate load and insert at driver side to skip
> > > converter step and unify flow for No sort and global sort insert.
> > > 2. Need to avoid reorder of each row. By changing select dataframe's
> > > projection order itself during the insert into.
> > > 3. For carbon to carbon insert, need to provide the ReadSupport and use
> > > RecordReader (vector reader currently doesn't support ReadSupport) to
> > > handle null values, time stamp cutoff (direct dictionary) from scanRDD
> > > result.
> > > 4. Need to handle insert into partition/non-partition table in local
> > sort,
> > > global sort, no sort, range columns, compaction flow.
> > >
> > > The final goal is to improve insert performance by keeping only
> required
> > > logic and also decrease the memory footprint.
> > >
> > > If you have any other suggestions or optimizations related to this let
> me
> > > know.
> > >
> > > Thanks,
> > > Ajantha
> > >
> >
>

ravipesala

Re: Optimize and refactor insert into command

In reply to this post by Ajantha Bhat

Hi,

+1
It’s a long pending work. Most welcome.

Regards,
Ravindra.

On Fri, 20 Dec 2019 at 7:55 AM, Ajantha Bhat <[hidden email]> wrote:

--
Thanks & Regards,
Ravi