Login  Register

回复: data lost when loading data from csv file to carbon table

Posted by 李寅威 on Feb 16, 2017; 3:53am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/data-lost-when-loading-data-from-csv-file-to-carbon-table-tp7554p7638.html

Hi Ravindra,


I run two way to loading data of benchmark tpc-ds and there are 25 tables in total:


first way(using the new data loading solution):


val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|')")





second way(using kettle solution):


scala> import org.apache.carbondata.core.util.CarbonProperties
scala> CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> CarbonProperties.getInstance().addProperty("carbon.kettle.home","/opt/spark-2.1.0/carbonlib/carbonplugins")
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")



unfortunately, 23 of the tables have a correct result except two tables names store_returns and web_sales.
after loading the data of the two tables, kettle solution make a correct result while the new solution in 1.0.0 seems have a data lost. I doult whether there is a bug.






------------------ 原始邮件 ------------------
发件人: "ﻬ.贝壳里的海";<[hidden email]>;
发送时间: 2017年2月16日(星期四) 中午11:14
收件人: "dev"<[hidden email]>;

主题: 回复: data lost when loading data from csv file to carbon table



thx Ravindra.


I've run the script as:


scala> import org.apache.carbondata.core.util.CarbonProperties
scala> CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")



but it occured an Exception: java.lang.RuntimeException: carbon.kettle.home is not set


the configuration in my carbon.properties is: carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems not work.


how can I solve this problem.


------


Hi Liang Chen,


    would you add a more detail document about the badRecord shows us how to use it, thx~~










------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月15日(星期三) 中午11:36
收件人: "dev"<[hidden email]>;

主题: Re: data lost when loading data from csv file to carbon table



Hi,

I guess you are using spark-shell, so better set bad record location to
CarbonProperties class before creating carbon session like below.

CarbonProperties.getInstance().addProperty("carbon.badRecords.location","<bad
record location>").


1. And while loading data you need to enable bad record logging as below.

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
'='true')").

Please check the bad records which are added to that bad record location.


2. You can alternatively verify by ignoring the bad records by using
following command
carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
'bad_records_action'='ignore')").

Regards,
Ravindra.

On 15 February 2017 at 07:37, Yinwei Li <[hidden email]> wrote:

> Hi,
>
>
>     I've set the properties as:
>
>
>     carbon.badRecords.location=hdfs://localhost:9000/data/
> carbondata/badrecords
>
>
>     and add 'bad_records_action'='force' when loading data as:
>
>
>     carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
>
>
>     but the configurations seems not work as there are no path or file
> created under the path hdfs://localhost:9000/data/carbondata/badrecords.
>
>
>     here are the way I created carbonContext:
>
>
>     import org.apache.spark.sql.SparkSession
>     import org.apache.spark.sql.CarbonSession._
>     import org.apache.spark.sql.catalyst.util._
>     val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
>
>
>
>
>     and the following are bad record logs:
>
>
>     INFO  15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
>     ERROR 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Data Load is partially success for table web_sales
>     INFO  15-02 09:43:24,393 - Bad Record Found
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ravindra Pesala";<[hidden email]>;
> 发送时间: 2017年2月14日(星期二) 晚上10:41
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> Please set carbon.badRecords.location in carbon.properties and check any
> bad records are added to that location.
>
>
> Regards,
> Ravindra.
>
> On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:
>
> > Hi all,
> >
> >
> >   I met an data lost problem when loading data from csv file to carbon
> > table, here are some details:
> >
> >
> >   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> >   Total Records:719,384
> >   Loaded Records:606,305 (SQL: select count(1) from table)
> >
> >
> >   My Attemps:
> >
> >
> >     Attemp1: Add option bad_records_action='force' when loading data. It
> > also doesn't work, it's count equals to 606,305;
> >     Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> is
> > right, which equals to 300,000;
> >     Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> is
> > wrong, it equals to 305,631;
> >     Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> > result is right, it equals to 50,000;
> >     Attemp5: Count the separator '|' of my csv file, it equals to lines *
> > columns,  so the source data may in the correct format;
> >
> >
> >     In spark log, each attemp logs out : "Bad Record Found".
> >
> >
> >     Anyone have any ideas?
>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi