Apache CarbonData Dev Mailing List archive - 回复： data lost when loading data from csv file to carbon table

Apache CarbonData Dev Mailing List archive

回复： data lost when loading data from csv file to carbon table

Posted by 李寅威 on Feb 15, 2017; 2:07am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/data-lost-when-loading-data-from-csv-file-to-carbon-table-tp7554p7574.html

Hi,

I've set the properties as:

carbon.badRecords.location=hdfs://localhost:9000/data/carbondata/badrecords

and add 'bad_records_action'='force' when loading data as:

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")

but the configurations seems not work as there are no path or file created under the path hdfs://localhost:9000/data/carbondata/badrecords.

here are the way I created carbonContext:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.catalyst.util._
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")

and the following are bad record logs:

INFO 15-02 09:43:24,393 - [Executor task launch worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be] Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
ERROR 15-02 09:43:24,393 - [Executor task launch worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be] Data Load is partially success for table web_sales
INFO 15-02 09:43:24,393 - Bad Record Found

------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月14日(星期二) 晚上10:41
收件人: "dev"<[hidden email]>;

主题: Re: data lost when loading data from csv file to carbon table

Hi,

Please set carbon.badRecords.location in carbon.properties and check any
bad records are added to that location.

Regards,
Ravindra.

On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:

> Hi all,
>
>
> I met an data lost problem when loading data from csv file to carbon
> table, here are some details:
>
>
> Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> Total Records:719,384
> Loaded Records:606,305 (SQL: select count(1) from table)
>
>
> My Attemps:
>
>
> Attemp1: Add option bad_records_action='force' when loading data. It
> also doesn't work, it's count equals to 606,305;
> Attemp2: Cut line 1 to 300,000 into a csv file and load, the result is
> right, which equals to 300,000;
> Attemp3: Cut line 1 to 350,000 into a csv file and load, the result is
> wrong, it equals to 305,631;
> Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> result is right, it equals to 50,000;
> Attemp5: Count the separator '|' of my csv file, it equals to lines *
> columns, so the source data may in the correct format;
>
>
> In spark log, each attemp logs out : "Bad Record Found".
>
>
> Anyone have any ideas?

--
Thanks & Regards,
Ravi