[jira] [Resolved] (CARBONDATA-1700) Failed to load data to existed table after spark session restarted

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (CARBONDATA-1700) Failed to load data to existed table after spark session restarted

Akash R Nilugal (Jira)

     [ https://issues.apache.org/jira/browse/CARBONDATA-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravindra Pesala resolved CARBONDATA-1700.
-----------------------------------------
    Resolution: Fixed

> Failed to load data to existed table after spark session restarted
> ------------------------------------------------------------------
>
>                 Key: CARBONDATA-1700
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1700
>             Project: CarbonData
>          Issue Type: Bug
>          Components: data-load
>    Affects Versions: 1.3.0
>            Reporter: xuchuanyin
>            Assignee: xuchuanyin
>             Fix For: 1.3.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> # scenario
> I encounterd loading data to existed carbondata table failure after query the table after restarting spark session. I have this failure in spark local mode (found it during local test) and haven't test in other scenarioes.
> The problem can be reproduced by following steps:
> 0. START: start a session;
> 1. CREATE: create table `t1`;
> 2. LOAD: create a dataframe and write apppend to `t1`;
> 3. STOP: stop current session;
> 4. START: start a session;
> 5. QUERY: query table `t1`;  ----  This step is essential to reproduce the problem.
> 6. LOAD: create a dataframe and write append to `t1`;  --- This step will be failed.
> Error will be thrown in Step6. The error message in console looks like
> ```
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
> at org.apache.spark.sql.execution.command.management.LoadTableCommand.processData(LoadTableCommand.scala:92)
> at org.apache.spark.sql.execution.command.management.LoadTableCommand.run(LoadTableCommand.scala:60)
> at org.apache.spark.sql.CarbonDataFrameWriter.loadDataFrame(CarbonDataFrameWriter.scala:141)
> at org.apache.spark.sql.CarbonDataFrameWriter.writeToCarbonFile(CarbonDataFrameWriter.scala:50)
> at org.apache.spark.sql.CarbonDataFrameWriter.appendToCarbonFile(CarbonDataFrameWriter.scala:42)
> at org.apache.spark.sql.CarbonSource.createRelation(CarbonSource.scala:110)
> at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
> ```
> The following code can be pasted in `TestLoadDataFrame.scala` to reproduce this problem —— but keep
> in mind you should manually run the first test and then the second in different iteration (to make sure that the sparksession is restarted).
> ```
>   test("prepare") {
>     sql("drop table if exists carbon_stand_alone")
>     sql( "create table if not exists carbon_stand_alone (c1 string, c2 string, c3 int)" +
>     " stored by 'carbondata'").collect()
>     sql("select * from carbon_stand_alone").show()
>     df.write
>       .format("carbondata")
>       .option("tableName", "carbon_stand_alone")
>       .option("tempCSV", "false")
>       .mode(SaveMode.Append)
>       .save()
>   }
>   test("test load dataframe after query") {
>     sql("select * from carbon_stand_alone").show()
>     // the following line will cause failure
>     df.write
>       .format("carbondata")
>       .option("tableName", "carbon_stand_alone")
>       .option("tempCSV", "false")
>       .mode(SaveMode.Append)
>       .save()
>     // if it works fine, it sould be true
>     checkAnswer(
>       sql("select count(*) from carbon_stand_alone where c3 > 500"), Row(31500 * 2)
>     )
>   }
> ```
> # ANALYSE
> I went through the code and found the problem was caused by NULL `tableProperties` in `tablemeta: tableMeta.carbonTable.getTableInfo
>       .getFactTable.getTableProperties` (we will name it `propertyInTableInfo` for short) is null in Line89 in `LoadTableCommand.scala`.
> After debug, I found that the `propertyInTableInfo` sett in `CarbonTableInputFormat.setTableInfo(...)` had the correct value. But `CarbonTableInputFormat.getTableInfo(...)` had the incorrect value. The setter is used to serialized TableInfo, while the getter is used to deserialized TableInfo ———— That means there are something wrong in serialization-deserialization.
> Keep diving into the code, I found that serialization and deserialization in `TableSchema`, a member of `TableInfo`, ignores the `tableProperties` member, thus causing this value empty after deserialization. Since this value has not been initialized in construtor, so the value remains `NULL` and cause the NPE problem.
> # RESOLVE
> 1. Initialize `tableProperties` in `TableSchema`
> 2. Include `tableProperties` in serialization-deserialization of `TableSchema`
> # Notes
> Although the bug has been fix, I still can't understand why the problem can be triggered in above way.
> Tests need the sparksession to be restarted, which is impossible currently, so no tests will be added.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)