Hi dev,
Currently we have a validation that if there are two carbondata files in a location with different schema, then we fail the query. I think there is no need to fail. If you see the parquet behavior also we cna understand. Here i think failing is not good, we can read the latets schema from latest carbondata file in the given location and based on that read all the files and give query output. For the columns which are not present in some data files, it wil have null values for the new column. But here basically we do not merge schema. we can maintain the same now also, only thing is can take latest schma. for example: 1. one data file with columns a,b and c. 2nd file is with columns a,b,c,d,e. then can read and create table with 5 columns or 3 columns which ever is latest and create table(This will be when user does not specify schema). If he species table will be created with specified schema. I have created a jira for this https://issues.apache.org/jira/browse/CARBONDATA-3287 If any input, please let me know. Regards, Akash |
Administrator
|
Hi
Can you explain which scenario will generate two carbondata files with different schema? Regards Liang akashrn5 wrote > Hi dev, > > Currently we have a validation that if there are two carbondata files in a > location with different schema, then we fail the query. I think there is > no > need to fail. If you see the parquet behavior also we cna understand. > > Here i think failing is not good, we can read the latets schema from > latest > carbondata file in the given location and based on that read all the files > and give query output. For the columns which are not present in some data > files, it wil have null values for the new column. > > But here basically we do not merge schema. we can maintain the same now > also, only thing is can take latest schma. > > for example: > 1. one data file with columns a,b and c. 2nd file is with columns > a,b,c,d,e. then can read and create table with 5 columns or 3 columns > which > ever is latest and create table(This will be when user does not specify > schema). If he species table will be created with specified schema. > > I have created a jira for this > https://issues.apache.org/jira/browse/CARBONDATA-3287 > If any input, please let me know. > > Regards, > Akash -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
hi akash,
I have one concern related to this change: *Concern*: why we are skipping old datafile? if user is not giving the schema then also i think we should read old data file. we can fill column *d* and *e* with *None* value. i guess *if data file is present at given location it means user wants to read data from all files*. *Suggestion*: In internal flow some how if we are maintaining the schema , we can use alter table flow as well. On Mon, Feb 4, 2019, 4:25 PM Liang Chen <[hidden email] wrote: > Hi > > Can you explain which scenario will generate two carbondata files with > different schema? > > Regards > Liang > > > akashrn5 wrote > > Hi dev, > > > > Currently we have a validation that if there are two carbondata files in > a > > location with different schema, then we fail the query. I think there is > > no > > need to fail. If you see the parquet behavior also we cna understand. > > > > Here i think failing is not good, we can read the latets schema from > > latest > > carbondata file in the given location and based on that read all the > files > > and give query output. For the columns which are not present in some data > > files, it wil have null values for the new column. > > > > But here basically we do not merge schema. we can maintain the same now > > also, only thing is can take latest schma. > > > > for example: > > 1. one data file with columns a,b and c. 2nd file is with columns > > a,b,c,d,e. then can read and create table with 5 columns or 3 columns > > which > > ever is latest and create table(This will be when user does not specify > > schema). If he species table will be created with specified schema. > > > > I have created a jira for this > > https://issues.apache.org/jira/browse/CARBONDATA-3287 > > If any input, please let me know. > > > > Regards, > > Akash > > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
In reply to this post by Liang Chen
Hi Liang,
When we create a table using location in file format case or when i create an external table from a location, user can place multiple carbondata files with different schema in that location and want to read the data at once, in that scenario we can expect the above condition. So currently we are not allowing that, but we can get the schema based on the latest carbon file and create table as i explained in the discussion. I think cleared your doubt. Thank you. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by rahul_kumar
Hi rahul,
Actually we are not skipping the old file, currently we are just listing the carbondata files in the location and then take first one to infer the schema, but now i just take the latest carbon data file to infer schema, and while giving the data, if the column is not present in corresponding file, then default value null is inserted and given to user. I hope this clears your doubt. Thank you -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |