Apache CarbonData Dev Mailing List archive

[Discussion]read latest schema in case of external table and file format

Classic

List

Threaded

5 messages Options

akashrn5

[Discussion]read latest schema in case of external table and file format

Hi dev,

Currently we have a validation that if there are two carbondata files in a
location with different schema, then we fail the query. I think there is no
need to fail. If you see the parquet behavior also we cna understand.

Here i think failing is not good, we can read the latets schema from latest
carbondata file in the given location and based on that read all the files
and give query output. For the columns which are not present in some data
files, it wil have null values for the new column.

But here basically we do not merge schema. we can maintain the same now
also, only thing is can take latest schma.

for example:
1. one data file with columns a,b and c. 2nd file is with columns
a,b,c,d,e. then can read and create table with 5 columns or 3 columns which
ever is latest and create table(This will be when user does not specify
schema). If he species table will be created with specified schema.

I have created a jira for this
https://issues.apache.org/jira/browse/CARBONDATA-3287
If any input, please let me know.

Regards,
Akash

Liang Chen

Re: [Discussion]read latest schema in case of external table and file format

Administrator

Hi

Can you explain which scenario will generate two carbondata files with
different schema?

Regards
Liang

akashrn5 wrote

> Hi dev,
>
> Currently we have a validation that if there are two carbondata files in a
> location with different schema, then we fail the query. I think there is
> no
> need to fail. If you see the parquet behavior also we cna understand.
>
> Here i think failing is not good, we can read the latets schema from
> latest
> carbondata file in the given location and based on that read all the files
> and give query output. For the columns which are not present in some data
> files, it wil have null values for the new column.
>
> But here basically we do not merge schema. we can maintain the same now
> also, only thing is can take latest schma.
>
> for example:
> 1. one data file with columns a,b and c. 2nd file is with columns
> a,b,c,d,e. then can read and create table with 5 columns or 3 columns
> which
> ever is latest and create table(This will be when user does not specify
> schema). If he species table will be created with specified schema.
>
> I have created a jira for this
> https://issues.apache.org/jira/browse/CARBONDATA-3287
> If any input, please let me know.
>
> Regards,
> Akash

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

rahul_kumar

Re: [Discussion]read latest schema in case of external table and file format

hi akash,
I have one concern related to this change:

*Concern*: why we are skipping old datafile? if user is not giving the
schema then also i think we should read old data file. we can fill column
*d* and *e* with *None* value.
i guess *if data file is present at given location it means user
wants to read data from all files*.

*Suggestion*: In internal flow some how if we are maintaining the schema ,
we can use alter table flow as well.

On Mon, Feb 4, 2019, 4:25 PM Liang Chen <[hidden email] wrote:

> Hi
>
> Can you explain which scenario will generate two carbondata files with
> different schema?
>
> Regards
> Liang
>
>
> akashrn5 wrote
> > Hi dev,
> >
> > Currently we have a validation that if there are two carbondata files in
> a
> > location with different schema, then we fail the query. I think there is
> > no
> > need to fail. If you see the parquet behavior also we cna understand.
> >
> > Here i think failing is not good, we can read the latets schema from
> > latest
> > carbondata file in the given location and based on that read all the
> files
> > and give query output. For the columns which are not present in some data
> > files, it wil have null values for the new column.
> >
> > But here basically we do not merge schema. we can maintain the same now
> > also, only thing is can take latest schma.
> >
> > for example:
> > 1. one data file with columns a,b and c. 2nd file is with columns
> > a,b,c,d,e. then can read and create table with 5 columns or 3 columns
> > which
> > ever is latest and create table(This will be when user does not specify
> > schema). If he species table will be created with specified schema.
> >
> > I have created a jira for this
> > https://issues.apache.org/jira/browse/CARBONDATA-3287
> > If any input, please let me know.
> >
> > Regards,
> > Akash
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

akashrn5

Re: [Discussion]read latest schema in case of external table and file format

In reply to this post by Liang Chen

Hi Liang,

When we create a table using location in file format case or when i create
an external table from a location, user can place multiple carbondata files
with different schema in that location and want to read the data at once, in
that scenario we can expect the above condition.

So currently we are not allowing that, but we can get the schema based on
the latest carbon file and create table as i explained in the discussion.

I think cleared your doubt.

Thank you.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

akashrn5

Re: [Discussion]read latest schema in case of external table and file format

In reply to this post by rahul_kumar

Hi rahul,

Actually we are not skipping the old file, currently we are just listing the
carbondata files in the location and then take first one to infer the
schema, but now i just take the latest carbon data file to infer schema, and
while giving the data, if the column is not present in corresponding file,
then default value null is inserted and given to user.

I hope this clears your doubt.

Thank you

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/