Apache CarbonData Dev Mailing List archive - 回复： data lost when loading data from csv file to carbon table

Apache CarbonData Dev Mailing List archive

回复： data lost when loading data from csv file to carbon table

Posted by 李寅威 on Feb 16, 2017; 3:53am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/data-lost-when-loading-data-from-csv-file-to-carbon-table-tp7554p7638.html

Hi Ravindra,

I run two way to loading data of benchmark tpc-ds and there are 25 tables in total:

first way(using the new data loading solution):

val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|')")

second way(using kettle solution):

scala> import org.apache.carbondata.core.util.CarbonProperties
scala> CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> CarbonProperties.getInstance().addProperty("carbon.kettle.home","/opt/spark-2.1.0/carbonlib/carbonplugins")
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")

unfortunately, 23 of the tables have a correct result except two tables names store_returns and web_sales.
after loading the data of the two tables, kettle solution make a correct result while the new solution in 1.0.0 seems have a data lost. I doult whether there is a bug.

------------------ 原始邮件 ------------------
发件人: "ﻬ.贝壳里的海";<[hidden email]>;
发送时间: 2017年2月16日(星期四) 中午11:14
收件人: "dev"<[hidden email]>;

主题: 回复： data lost when loading data from csv file to carbon table

thx Ravindra.

I've run the script as:

scala> import org.apache.carbondata.core.util.CarbonProperties
scala> CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")

but it occured an Exception: java.lang.RuntimeException: carbon.kettle.home is not set

the configuration in my carbon.properties is: carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems not work.

how can I solve this problem.

------

Hi Liang Chen,

would you add a more detail document about the badRecord shows us how to use it, thx~~

------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月15日(星期三) 中午11:36
收件人: "dev"<[hidden email]>;

主题: Re: data lost when loading data from csv file to carbon table

Hi,

I guess you are using spark-shell, so better set bad record location to
CarbonProperties class before creating carbon session like below.

CarbonProperties.getInstance().addProperty("carbon.badRecords.location","<bad
record location>").

1. And while loading data you need to enable bad record logging as below.

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
'='true')").

Please check the bad records which are added to that bad record location.

2. You can alternatively verify by ignoring the bad records by using
following command
carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
'bad_records_action'='ignore')").

Regards,
Ravindra.

On 15 February 2017 at 07:37, Yinwei Li <[hidden email]> wrote:

> Hi,
>
>
> I've set the properties as:
>
>
> carbon.badRecords.location=hdfs://localhost:9000/data/
> carbondata/badrecords
>
>
> and add 'bad_records_action'='force' when loading data as:
>
>
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
>
>
> but the configurations seems not work as there are no path or file
> created under the path hdfs://localhost:9000/data/carbondata/badrecords.
>
>
> here are the way I created carbonContext:
>
>
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.CarbonSession._
> import org.apache.spark.sql.catalyst.util._
> val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
>
>
>
>
> and the following are bad record logs:
>
>
> INFO 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
> ERROR 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Data Load is partially success for table web_sales
> INFO 15-02 09:43:24,393 - Bad Record Found
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ravindra Pesala";<[hidden email]>;
> 发送时间: 2017年2月14日(星期二) 晚上10:41
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> Please set carbon.badRecords.location in carbon.properties and check any
> bad records are added to that location.
>
>
> Regards,
> Ravindra.
>
> On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:
>
> > Hi all,
> >
> >
> > I met an data lost problem when loading data from csv file to carbon
> > table, here are some details:
> >
> >
> > Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> > Total Records:719,384
> > Loaded Records:606,305 (SQL: select count(1) from table)
> >
> >
> > My Attemps:
> >
> >
> > Attemp1: Add option bad_records_action='force' when loading data. It
> > also doesn't work, it's count equals to 606,305;
> > Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> is
> > right, which equals to 300,000;
> > Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> is
> > wrong, it equals to 305,631;
> > Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> > result is right, it equals to 50,000;
> > Attemp5: Count the separator '|' of my csv file, it equals to lines *
> > columns, so the source data may in the correct format;
> >
> >
> > In spark log, each attemp logs out : "Bad Record Found".
> >
> >
> > Anyone have any ideas?
>
>
>
>
> --
> Thanks & Regards,
> Ravi
>

--
Thanks & Regards,
Ravi