[ https://issues.apache.org/jira/browse/CARBONDATA-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala resolved CARBONDATA-1700. ----------------------------------------- Resolution: Fixed > Failed to load data to existed table after spark session restarted > ------------------------------------------------------------------ > > Key: CARBONDATA-1700 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1700 > Project: CarbonData > Issue Type: Bug > Components: data-load > Affects Versions: 1.3.0 > Reporter: xuchuanyin > Assignee: xuchuanyin > Fix For: 1.3.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > # scenario > I encounterd loading data to existed carbondata table failure after query the table after restarting spark session. I have this failure in spark local mode (found it during local test) and haven't test in other scenarioes. > The problem can be reproduced by following steps: > 0. START: start a session; > 1. CREATE: create table `t1`; > 2. LOAD: create a dataframe and write apppend to `t1`; > 3. STOP: stop current session; > 4. START: start a session; > 5. QUERY: query table `t1`; ---- This step is essential to reproduce the problem. > 6. LOAD: create a dataframe and write append to `t1`; --- This step will be failed. > Error will be thrown in Step6. The error message in console looks like > ``` > java.lang.NullPointerException was thrown. > java.lang.NullPointerException > at org.apache.spark.sql.execution.command.management.LoadTableCommand.processData(LoadTableCommand.scala:92) > at org.apache.spark.sql.execution.command.management.LoadTableCommand.run(LoadTableCommand.scala:60) > at org.apache.spark.sql.CarbonDataFrameWriter.loadDataFrame(CarbonDataFrameWriter.scala:141) > at org.apache.spark.sql.CarbonDataFrameWriter.writeToCarbonFile(CarbonDataFrameWriter.scala:50) > at org.apache.spark.sql.CarbonDataFrameWriter.appendToCarbonFile(CarbonDataFrameWriter.scala:42) > at org.apache.spark.sql.CarbonSource.createRelation(CarbonSource.scala:110) > at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) > ``` > The following code can be pasted in `TestLoadDataFrame.scala` to reproduce this problem —— but keep > in mind you should manually run the first test and then the second in different iteration (to make sure that the sparksession is restarted). > ``` > test("prepare") { > sql("drop table if exists carbon_stand_alone") > sql( "create table if not exists carbon_stand_alone (c1 string, c2 string, c3 int)" + > " stored by 'carbondata'").collect() > sql("select * from carbon_stand_alone").show() > df.write > .format("carbondata") > .option("tableName", "carbon_stand_alone") > .option("tempCSV", "false") > .mode(SaveMode.Append) > .save() > } > test("test load dataframe after query") { > sql("select * from carbon_stand_alone").show() > // the following line will cause failure > df.write > .format("carbondata") > .option("tableName", "carbon_stand_alone") > .option("tempCSV", "false") > .mode(SaveMode.Append) > .save() > // if it works fine, it sould be true > checkAnswer( > sql("select count(*) from carbon_stand_alone where c3 > 500"), Row(31500 * 2) > ) > } > ``` > # ANALYSE > I went through the code and found the problem was caused by NULL `tableProperties` in `tablemeta: tableMeta.carbonTable.getTableInfo > .getFactTable.getTableProperties` (we will name it `propertyInTableInfo` for short) is null in Line89 in `LoadTableCommand.scala`. > After debug, I found that the `propertyInTableInfo` sett in `CarbonTableInputFormat.setTableInfo(...)` had the correct value. But `CarbonTableInputFormat.getTableInfo(...)` had the incorrect value. The setter is used to serialized TableInfo, while the getter is used to deserialized TableInfo ———— That means there are something wrong in serialization-deserialization. > Keep diving into the code, I found that serialization and deserialization in `TableSchema`, a member of `TableInfo`, ignores the `tableProperties` member, thus causing this value empty after deserialization. Since this value has not been initialized in construtor, so the value remains `NULL` and cause the NPE problem. > # RESOLVE > 1. Initialize `tableProperties` in `TableSchema` > 2. Include `tableProperties` in serialization-deserialization of `TableSchema` > # Notes > Although the bug has been fix, I still can't understand why the problem can be triggered in above way. > Tests need the sparksession to be restarted, which is impossible currently, so no tests will be added. -- This message was sent by Atlassian JIRA (v6.4.14#64029) |
Free forum by Nabble | Edit this page |