data lost when loading data from csv file to carbon table

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

data lost when loading data from csv file to carbon table

李寅威
Hi all,


  I met an data lost problem when loading data from csv file to carbon table, here are some details:


  Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
  Total Records:719,384
  Loaded Records:606,305 (SQL: select count(1) from table)


  My Attemps:


    Attemp1: Add option bad_records_action='force' when loading data. It also doesn't work, it's count equals to 606,305;
    Attemp2: Cut line 1 to 300,000 into a csv file and load, the result is right, which equals to 300,000;
    Attemp3: Cut line 1 to 350,000 into a csv file and load, the result is wrong, it equals to 305,631;
    Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the result is right, it equals to 50,000;
    Attemp5: Count the separator '|' of my csv file, it equals to lines * columns,  so the source data may in the correct format;


    In spark log, each attemp logs out : "Bad Record Found".


    Anyone have any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: data lost when loading data from csv file to carbon table

ravipesala
Hi,

Please set carbon.badRecords.location in carbon.properties and check any
bad records are added to that location.


Regards,
Ravindra.

On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:

> Hi all,
>
>
>   I met an data lost problem when loading data from csv file to carbon
> table, here are some details:
>
>
>   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
>   Total Records:719,384
>   Loaded Records:606,305 (SQL: select count(1) from table)
>
>
>   My Attemps:
>
>
>     Attemp1: Add option bad_records_action='force' when loading data. It
> also doesn't work, it's count equals to 606,305;
>     Attemp2: Cut line 1 to 300,000 into a csv file and load, the result is
> right, which equals to 300,000;
>     Attemp3: Cut line 1 to 350,000 into a csv file and load, the result is
> wrong, it equals to 305,631;
>     Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> result is right, it equals to 50,000;
>     Attemp5: Count the separator '|' of my csv file, it equals to lines *
> columns,  so the source data may in the correct format;
>
>
>     In spark log, each attemp logs out : "Bad Record Found".
>
>
>     Anyone have any ideas?




--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

回复: data lost when loading data from csv file to carbon table

李寅威
Hi,


    I've set the properties as:


    carbon.badRecords.location=hdfs://localhost:9000/data/carbondata/badrecords


    and add 'bad_records_action'='force' when loading data as:


    carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")


    but the configurations seems not work as there are no path or file created under the path hdfs://localhost:9000/data/carbondata/badrecords.


    here are the way I created carbonContext:


    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.CarbonSession._
    import org.apache.spark.sql.catalyst.util._
    val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")




    and the following are bad record logs:


    INFO  15-02 09:43:24,393 - [Executor task launch worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be] Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
    ERROR 15-02 09:43:24,393 - [Executor task launch worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be] Data Load is partially success for table web_sales
    INFO  15-02 09:43:24,393 - Bad Record Found




------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月14日(星期二) 晚上10:41
收件人: "dev"<[hidden email]>;

主题: Re: data lost when loading data from csv file to carbon table



Hi,

Please set carbon.badRecords.location in carbon.properties and check any
bad records are added to that location.


Regards,
Ravindra.

On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:

> Hi all,
>
>
>   I met an data lost problem when loading data from csv file to carbon
> table, here are some details:
>
>
>   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
>   Total Records:719,384
>   Loaded Records:606,305 (SQL: select count(1) from table)
>
>
>   My Attemps:
>
>
>     Attemp1: Add option bad_records_action='force' when loading data. It
> also doesn't work, it's count equals to 606,305;
>     Attemp2: Cut line 1 to 300,000 into a csv file and load, the result is
> right, which equals to 300,000;
>     Attemp3: Cut line 1 to 350,000 into a csv file and load, the result is
> wrong, it equals to 305,631;
>     Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> result is right, it equals to 50,000;
>     Attemp5: Count the separator '|' of my csv file, it equals to lines *
> columns,  so the source data may in the correct format;
>
>
>     In spark log, each attemp logs out : "Bad Record Found".
>
>
>     Anyone have any ideas?




--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: data lost when loading data from csv file to carbon table

ravipesala
Hi,

I guess you are using spark-shell, so better set bad record location to
CarbonProperties class before creating carbon session like below.

CarbonProperties.getInstance().addProperty("carbon.badRecords.location","<bad
record location>").


1. And while loading data you need to enable bad record logging as below.

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
'='true')").

Please check the bad records which are added to that bad record location.


2. You can alternatively verify by ignoring the bad records by using
following command
carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
'bad_records_action'='ignore')").

Regards,
Ravindra.

On 15 February 2017 at 07:37, Yinwei Li <[hidden email]> wrote:

> Hi,
>
>
>     I've set the properties as:
>
>
>     carbon.badRecords.location=hdfs://localhost:9000/data/
> carbondata/badrecords
>
>
>     and add 'bad_records_action'='force' when loading data as:
>
>
>     carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
>
>
>     but the configurations seems not work as there are no path or file
> created under the path hdfs://localhost:9000/data/carbondata/badrecords.
>
>
>     here are the way I created carbonContext:
>
>
>     import org.apache.spark.sql.SparkSession
>     import org.apache.spark.sql.CarbonSession._
>     import org.apache.spark.sql.catalyst.util._
>     val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
>
>
>
>
>     and the following are bad record logs:
>
>
>     INFO  15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
>     ERROR 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Data Load is partially success for table web_sales
>     INFO  15-02 09:43:24,393 - Bad Record Found
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ravindra Pesala";<[hidden email]>;
> 发送时间: 2017年2月14日(星期二) 晚上10:41
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> Please set carbon.badRecords.location in carbon.properties and check any
> bad records are added to that location.
>
>
> Regards,
> Ravindra.
>
> On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:
>
> > Hi all,
> >
> >
> >   I met an data lost problem when loading data from csv file to carbon
> > table, here are some details:
> >
> >
> >   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> >   Total Records:719,384
> >   Loaded Records:606,305 (SQL: select count(1) from table)
> >
> >
> >   My Attemps:
> >
> >
> >     Attemp1: Add option bad_records_action='force' when loading data. It
> > also doesn't work, it's count equals to 606,305;
> >     Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> is
> > right, which equals to 300,000;
> >     Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> is
> > wrong, it equals to 305,631;
> >     Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> > result is right, it equals to 50,000;
> >     Attemp5: Count the separator '|' of my csv file, it equals to lines *
> > columns,  so the source data may in the correct format;
> >
> >
> >     In spark log, each attemp logs out : "Bad Record Found".
> >
> >
> >     Anyone have any ideas?
>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

回复: data lost when loading data from csv file to carbon table

李寅威
thx Ravindra.


I've run the script as:


scala> import org.apache.carbondata.core.util.CarbonProperties
scala> CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")



but it occured an Exception: java.lang.RuntimeException: carbon.kettle.home is not set


the configuration in my carbon.properties is: carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems not work.


how can I solve this problem.


------


Hi Liang Chen,


    would you add a more detail document about the badRecord shows us how to use it, thx~~










------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月15日(星期三) 中午11:36
收件人: "dev"<[hidden email]>;

主题: Re: data lost when loading data from csv file to carbon table



Hi,

I guess you are using spark-shell, so better set bad record location to
CarbonProperties class before creating carbon session like below.

CarbonProperties.getInstance().addProperty("carbon.badRecords.location","<bad
record location>").


1. And while loading data you need to enable bad record logging as below.

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
'='true')").

Please check the bad records which are added to that bad record location.


2. You can alternatively verify by ignoring the bad records by using
following command
carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
'bad_records_action'='ignore')").

Regards,
Ravindra.

On 15 February 2017 at 07:37, Yinwei Li <[hidden email]> wrote:

> Hi,
>
>
>     I've set the properties as:
>
>
>     carbon.badRecords.location=hdfs://localhost:9000/data/
> carbondata/badrecords
>
>
>     and add 'bad_records_action'='force' when loading data as:
>
>
>     carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
>
>
>     but the configurations seems not work as there are no path or file
> created under the path hdfs://localhost:9000/data/carbondata/badrecords.
>
>
>     here are the way I created carbonContext:
>
>
>     import org.apache.spark.sql.SparkSession
>     import org.apache.spark.sql.CarbonSession._
>     import org.apache.spark.sql.catalyst.util._
>     val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
>
>
>
>
>     and the following are bad record logs:
>
>
>     INFO  15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
>     ERROR 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Data Load is partially success for table web_sales
>     INFO  15-02 09:43:24,393 - Bad Record Found
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ravindra Pesala";<[hidden email]>;
> 发送时间: 2017年2月14日(星期二) 晚上10:41
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> Please set carbon.badRecords.location in carbon.properties and check any
> bad records are added to that location.
>
>
> Regards,
> Ravindra.
>
> On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:
>
> > Hi all,
> >
> >
> >   I met an data lost problem when loading data from csv file to carbon
> > table, here are some details:
> >
> >
> >   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> >   Total Records:719,384
> >   Loaded Records:606,305 (SQL: select count(1) from table)
> >
> >
> >   My Attemps:
> >
> >
> >     Attemp1: Add option bad_records_action='force' when loading data. It
> > also doesn't work, it's count equals to 606,305;
> >     Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> is
> > right, which equals to 300,000;
> >     Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> is
> > wrong, it equals to 305,631;
> >     Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> > result is right, it equals to 50,000;
> >     Attemp5: Count the separator '|' of my csv file, it equals to lines *
> > columns,  so the source data may in the correct format;
> >
> >
> >     In spark log, each attemp logs out : "Bad Record Found".
> >
> >
> >     Anyone have any ideas?
>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: data lost when loading data from csv file to carbon table

ravipesala
Please make 'use_kettle'='false' and try to run.

Regards,
Ravindra

On 16 February 2017 at 08:44, Yinwei Li <[hidden email]> wrote:

> thx Ravindra.
>
>
> I've run the script as:
>
>
> scala> import org.apache.carbondata.core.util.CarbonProperties
> scala> CarbonProperties.getInstance().addProperty("carbon.
> badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
> scala> val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
> scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
> 'use_kettle'='true')")
>
>
>
> but it occured an Exception: java.lang.RuntimeException:
> carbon.kettle.home is not set
>
>
> the configuration in my carbon.properties is:
> carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems
> not work.
>
>
> how can I solve this problem.
>
>
> ------
>
>
> Hi Liang Chen,
>
>
>     would you add a more detail document about the badRecord shows us how
> to use it, thx~~
>
>
>
>
>
>
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ravindra Pesala";<[hidden email]>;
> 发送时间: 2017年2月15日(星期三) 中午11:36
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> I guess you are using spark-shell, so better set bad record location to
> CarbonProperties class before creating carbon session like below.
>
> CarbonProperties.getInstance().addProperty("carbon.
> badRecords.location","<bad
> record location>").
>
>
> 1. And while loading data you need to enable bad record logging as below.
>
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
> OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
> '='true')").
>
> Please check the bad records which are added to that bad record location.
>
>
> 2. You can alternatively verify by ignoring the bad records by using
> following command
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
> OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
> 'bad_records_action'='ignore')").
>
> Regards,
> Ravindra.
>
> On 15 February 2017 at 07:37, Yinwei Li <[hidden email]> wrote:
>
> > Hi,
> >
> >
> >     I've set the properties as:
> >
> >
> >     carbon.badRecords.location=hdfs://localhost:9000/data/
> > carbondata/badrecords
> >
> >
> >     and add 'bad_records_action'='force' when loading data as:
> >
> >
> >     carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> > _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
> >
> >
> >     but the configurations seems not work as there are no path or file
> > created under the path hdfs://localhost:9000/data/carbondata/badrecords.
> >
> >
> >     here are the way I created carbonContext:
> >
> >
> >     import org.apache.spark.sql.SparkSession
> >     import org.apache.spark.sql.CarbonSession._
> >     import org.apache.spark.sql.catalyst.util._
> >     val carbon = SparkSession.builder().config(sc.getConf).
> > getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
> >
> >
> >
> >
> >     and the following are bad record logs:
> >
> >
> >     INFO  15-02 09:43:24,393 - [Executor task launch
> > worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-
> 031d602fe2be]
> > Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> > web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
> >     ERROR 15-02 09:43:24,393 - [Executor task launch
> > worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-
> 031d602fe2be]
> > Data Load is partially success for table web_sales
> >     INFO  15-02 09:43:24,393 - Bad Record Found
> >
> >
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "Ravindra Pesala";<[hidden email]>;
> > 发送时间: 2017年2月14日(星期二) 晚上10:41
> > 收件人: "dev"<[hidden email]>;
> >
> > 主题: Re: data lost when loading data from csv file to carbon table
> >
> >
> >
> > Hi,
> >
> > Please set carbon.badRecords.location in carbon.properties and check any
> > bad records are added to that location.
> >
> >
> > Regards,
> > Ravindra.
> >
> > On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:
> >
> > > Hi all,
> > >
> > >
> > >   I met an data lost problem when loading data from csv file to carbon
> > > table, here are some details:
> > >
> > >
> > >   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> > >   Total Records:719,384
> > >   Loaded Records:606,305 (SQL: select count(1) from table)
> > >
> > >
> > >   My Attemps:
> > >
> > >
> > >     Attemp1: Add option bad_records_action='force' when loading data.
> It
> > > also doesn't work, it's count equals to 606,305;
> > >     Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> > is
> > > right, which equals to 300,000;
> > >     Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> > is
> > > wrong, it equals to 305,631;
> > >     Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> > > result is right, it equals to 50,000;
> > >     Attemp5: Count the separator '|' of my csv file, it equals to
> lines *
> > > columns,  so the source data may in the correct format;
> > >
> > >
> > >     In spark log, each attemp logs out : "Bad Record Found".
> > >
> > >
> > >     Anyone have any ideas?
> >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

回复: data lost when loading data from csv file to carbon table

李寅威
In reply to this post by 李寅威
Hi Ravindra,


I run two way to loading data of benchmark tpc-ds and there are 25 tables in total:


first way(using the new data loading solution):


val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|')")





second way(using kettle solution):


scala> import org.apache.carbondata.core.util.CarbonProperties
scala> CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> CarbonProperties.getInstance().addProperty("carbon.kettle.home","/opt/spark-2.1.0/carbonlib/carbonplugins")
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")



unfortunately, 23 of the tables have a correct result except two tables names store_returns and web_sales.
after loading the data of the two tables, kettle solution make a correct result while the new solution in 1.0.0 seems have a data lost. I doult whether there is a bug.






------------------ 原始邮件 ------------------
发件人: "ﻬ.贝壳里的海";<[hidden email]>;
发送时间: 2017年2月16日(星期四) 中午11:14
收件人: "dev"<[hidden email]>;

主题: 回复: data lost when loading data from csv file to carbon table



thx Ravindra.


I've run the script as:


scala> import org.apache.carbondata.core.util.CarbonProperties
scala> CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")



but it occured an Exception: java.lang.RuntimeException: carbon.kettle.home is not set


the configuration in my carbon.properties is: carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems not work.


how can I solve this problem.


------


Hi Liang Chen,


    would you add a more detail document about the badRecord shows us how to use it, thx~~










------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月15日(星期三) 中午11:36
收件人: "dev"<[hidden email]>;

主题: Re: data lost when loading data from csv file to carbon table



Hi,

I guess you are using spark-shell, so better set bad record location to
CarbonProperties class before creating carbon session like below.

CarbonProperties.getInstance().addProperty("carbon.badRecords.location","<bad
record location>").


1. And while loading data you need to enable bad record logging as below.

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
'='true')").

Please check the bad records which are added to that bad record location.


2. You can alternatively verify by ignoring the bad records by using
following command
carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
'bad_records_action'='ignore')").

Regards,
Ravindra.

On 15 February 2017 at 07:37, Yinwei Li <[hidden email]> wrote:

> Hi,
>
>
>     I've set the properties as:
>
>
>     carbon.badRecords.location=hdfs://localhost:9000/data/
> carbondata/badrecords
>
>
>     and add 'bad_records_action'='force' when loading data as:
>
>
>     carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
>
>
>     but the configurations seems not work as there are no path or file
> created under the path hdfs://localhost:9000/data/carbondata/badrecords.
>
>
>     here are the way I created carbonContext:
>
>
>     import org.apache.spark.sql.SparkSession
>     import org.apache.spark.sql.CarbonSession._
>     import org.apache.spark.sql.catalyst.util._
>     val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
>
>
>
>
>     and the following are bad record logs:
>
>
>     INFO  15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
>     ERROR 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Data Load is partially success for table web_sales
>     INFO  15-02 09:43:24,393 - Bad Record Found
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ravindra Pesala";<[hidden email]>;
> 发送时间: 2017年2月14日(星期二) 晚上10:41
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> Please set carbon.badRecords.location in carbon.properties and check any
> bad records are added to that location.
>
>
> Regards,
> Ravindra.
>
> On 14 February 2017 at 15:24, Yinwei Li <[hidden email]> wrote:
>
> > Hi all,
> >
> >
> >   I met an data lost problem when loading data from csv file to carbon
> > table, here are some details:
> >
> >
> >   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> >   Total Records:719,384
> >   Loaded Records:606,305 (SQL: select count(1) from table)
> >
> >
> >   My Attemps:
> >
> >
> >     Attemp1: Add option bad_records_action='force' when loading data. It
> > also doesn't work, it's count equals to 606,305;
> >     Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> is
> > right, which equals to 300,000;
> >     Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> is
> > wrong, it equals to 305,631;
> >     Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> > result is right, it equals to 50,000;
> >     Attemp5: Count the separator '|' of my csv file, it equals to lines *
> > columns,  so the source data may in the correct format;
> >
> >
> >     In spark log, each attemp logs out : "Bad Record Found".
> >
> >
> >     Anyone have any ideas?
>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: 回复: data lost when loading data from csv file to carbon table

David CaiQiang
Maybe you can check PR594, it will fix a bug which will impact the result of loading.
Best Regards
David Cai
Reply | Threaded
Open this post in threaded view
|

Re: 回复: data lost when loading data from csv file to carbon table

ravipesala
Hi Yinwei,

Thank you for pointing out the issue, I will check with TPC-DS data and
verify the data load with new flow.

Regards,
Ravindra.

On 16 February 2017 at 09:35, QiangCai <[hidden email]> wrote:

> Maybe you can check PR594, it will fix a bug which will impact the result
> of
> loading.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/data-lost-when-
> loading-data-from-csv-file-to-carbon-table-tp7554p7639.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: 回复: data lost when loading data from csv file to carbon table

ravipesala
Hi Yinwei,

Can you provide create table scripts for both the tables store_returns and
web_sales.

Regards,
Ravindra.

On 16 February 2017 at 10:07, Ravindra Pesala <[hidden email]> wrote:

> Hi Yinwei,
>
> Thank you for pointing out the issue, I will check with TPC-DS data and
> verify the data load with new flow.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 09:35, QiangCai <[hidden email]> wrote:
>
>> Maybe you can check PR594, it will fix a bug which will impact the result
>> of
>> loading.
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-maili
>> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load
>> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

回复: data lost when loading data from csv file to carbon table

李寅威
Hi Ravindra:


I add DICTIONARY_INCLUDE for each of them:


carbon.sql("create table if not exists _1g.store_returns(sr_returned_date_sk integer, sr_return_time_sk integer, sr_item_sk integer, sr_customer_sk integer, sr_cdemo_sk integer, sr_hdemo_sk integer, sr_addr_sk integer, sr_store_sk integer, sr_reason_sk integer, sr_ticket_number integer, sr_return_quantity integer, sr_return_amt decimal(7,2), sr_return_tax decimal(7,2), sr_return_amt_inc_tax decimal(7,2), sr_fee decimal(7,2), sr_return_ship_cost decimal(7,2), sr_refunded_cash decimal(7,2), sr_reversed_charge decimal(7,2), sr_store_credit decimal(7,2), sr_net_loss decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='sr_returned_date_sk, sr_return_time_sk, sr_item_sk, sr_customer_sk, sr_cdemo_sk, sr_hdemo_sk, sr_addr_sk, sr_store_sk, sr_reason_sk, sr_ticket_number')");




carbon.sql("create table if not exists _1g.web_sales(ws_sold_date_sk integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer, ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer, ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk integer, ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk integer, ws_order_number integer, ws_quantity integer, ws_wholesale_cost decimal(7,2), ws_list_price decimal(7,2), ws_sales_price decimal(7,2), ws_ext_discount_amt decimal(7,2), ws_ext_sales_price decimal(7,2), ws_ext_wholesale_cost decimal(7,2), ws_ext_list_price decimal(7,2), ws_ext_tax decimal(7,2), ws_coupon_amt decimal(7,2), ws_ext_ship_cost decimal(7,2), ws_net_paid decimal(7,2), ws_net_paid_inc_tax decimal(7,2), ws_net_paid_inc_ship decimal(7,2), ws_net_paid_inc_ship_tax decimal(7,2), ws_net_profit decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk, ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk, ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk, ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk, ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk, ws_order_number')");



and here is my script for generate tpc-ds data:
[hadoop@master tools]$ ./dsdgen -scale 1 -suffix '.csv' -dir /data/tpc-ds/data/








------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala";<[hidden email]>;
发送时间: 2017年2月16日(星期四) 下午3:15
收件人: "dev"<[hidden email]>;

主题: Re: 回复: data lost when loading data from csv file to carbon table



Hi Yinwei,

Can you provide create table scripts for both the tables store_returns and
web_sales.

Regards,
Ravindra.

On 16 February 2017 at 10:07, Ravindra Pesala <[hidden email]> wrote:

> Hi Yinwei,
>
> Thank you for pointing out the issue, I will check with TPC-DS data and
> verify the data load with new flow.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 09:35, QiangCai <[hidden email]> wrote:
>
>> Maybe you can check PR594, it will fix a bug which will impact the result
>> of
>> loading.
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-maili
>> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load
>> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: data lost when loading data from csv file to carbon table

ravipesala
Hi,

I have generated tpcds data using https://github.com/brownsys/tpcds .
And I have loaded the data using with kettle flow and with new flow, both
gives same number of rows after using select count(*) query on the table.
Even I have counted rows in excel file , it is matching with count query

Number of rows loaded
store_returns : 288279
web_sales : 718931

Scripts :

spark.sql("""
       CREATE TABLE IF NOT EXISTS STORE_RETURNS
  (sr_returned_date_sk  bigint,
   sr_return_time_sk    bigint,
   sr_item_sk           bigint,
   sr_customer_sk       bigint,
   sr_cdemo_sk          bigint,
   sr_hdemo_sk          bigint,
   sr_addr_sk           bigint,
   sr_store_sk          bigint,
   sr_reason_sk         bigint,
   sr_ticket_number     bigint,
   sr_return_quantity   bigint,
   sr_return_amt        decimal(7,2),
   sr_return_tax        decimal(7,2),
   sr_return_amt_inc_tax decimal(7,2),
   sr_fee               decimal(7,2),
   sr_return_ship_cost  decimal(7,2),
   sr_refunded_cash     decimal(7,2),
   sr_reversed_charge   decimal(7,2),
   sr_store_credit      decimal(7,2),
   sr_net_loss          decimal(7,2))
   STORED BY 'carbondata'
   TBLPROPERTIES('dictionary_include'='sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number')
       """)

spark.sql("create table if not exists web_sales(ws_sold_date_sk
integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk
integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer,
ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk
integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer,
ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk
integer, ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk
integer, ws_order_number integer, ws_quantity integer,
ws_wholesale_cost decimal(7,2), ws_list_price decimal(7,2),
ws_sales_price decimal(7,2), ws_ext_discount_amt decimal(7,2),
ws_ext_sales_price decimal(7,2), ws_ext_wholesale_cost decimal(7,2),
ws_ext_list_price decimal(7,2), ws_ext_tax decimal(7,2), ws_coupon_amt
decimal(7,2), ws_ext_ship_cost decimal(7,2), ws_net_paid decimal(7,2),
ws_net_paid_inc_tax decimal(7,2), ws_net_paid_inc_ship decimal(7,2),
ws_net_paid_inc_ship_tax decimal(7,2), ws_net_profit decimal(7,2))
STORED BY 'carbondata' TBLPROPERTIES
('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk,
ws_ship_date_sk, ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk,
ws_bill_hdemo_sk, ws_bill_addr_sk, ws_ship_customer_sk,
ws_ship_cdemo_sk, ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk,
ws_web_site_sk, ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk,
ws_order_number')")

spark.sql(s"""
       LOAD DATA LOCAL INPATH
'/home/root1/Downloads/store_returns.csv' into table STORE_RETURNS
options('DELIMITER'='|',
'FILEHEADER'='sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number,sr_return_quantity,sr_return_amt,sr_return_tax,sr_return_amt_inc_tax,sr_fee,sr_return_ship_cost,sr_refunded_cash,sr_reversed_charge,sr_store_credit,sr_net_loss',
'use_kettle'='false')
       """)

spark.sql(s"""
       LOAD DATA LOCAL INPATH
'hdfs://localhost:9000/tpcds/web_sales/part-r-00000-dca70590-4d9d-4cc9-aff4-e20b85970d2b'
into table web_sales options('DELIMITER'='|',
'FILEHEADER'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk,
ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk,
ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk,
ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk,
ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk, ws_order_number,
ws_quantity, ws_wholesale_cost, ws_list_price, ws_sales_price,
ws_ext_discount_amt, ws_ext_sales_price, ws_ext_wholesale_cost,
ws_ext_list_price, ws_ext_tax, ws_coupon_amt, ws_ext_ship_cost,
ws_net_paid, ws_net_paid_inc_tax, ws_net_paid_inc_ship,
ws_net_paid_inc_ship_tax, ws_net_profit', 'use_kettle'='false')
       """)


Did I miss something here?


Regards,
Ravindra.

On 16 February 2017 at 12:24, Yinwei Li <[hidden email]> wrote:

> Hi Ravindra:
>
>
> I add DICTIONARY_INCLUDE for each of them:
>
>
> carbon.sql("create table if not exists _1g.store_returns(sr_returned_date_sk
> integer, sr_return_time_sk integer, sr_item_sk integer, sr_customer_sk
> integer, sr_cdemo_sk integer, sr_hdemo_sk integer, sr_addr_sk integer,
> sr_store_sk integer, sr_reason_sk integer, sr_ticket_number integer,
> sr_return_quantity integer, sr_return_amt decimal(7,2), sr_return_tax
> decimal(7,2), sr_return_amt_inc_tax decimal(7,2), sr_fee decimal(7,2),
> sr_return_ship_cost decimal(7,2), sr_refunded_cash decimal(7,2),
> sr_reversed_charge decimal(7,2), sr_store_credit decimal(7,2), sr_net_loss
> decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES
> ('DICTIONARY_INCLUDE'='sr_returned_date_sk, sr_return_time_sk,
> sr_item_sk, sr_customer_sk, sr_cdemo_sk, sr_hdemo_sk, sr_addr_sk,
> sr_store_sk, sr_reason_sk, sr_ticket_number')");
>
>
>
>
> carbon.sql("create table if not exists _1g.web_sales(ws_sold_date_sk
> integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk
> integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer,
> ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk
> integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer,
> ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk integer,
> ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk integer,
> ws_order_number integer, ws_quantity integer, ws_wholesale_cost
> decimal(7,2), ws_list_price decimal(7,2), ws_sales_price decimal(7,2),
> ws_ext_discount_amt decimal(7,2), ws_ext_sales_price decimal(7,2),
> ws_ext_wholesale_cost decimal(7,2), ws_ext_list_price decimal(7,2),
> ws_ext_tax decimal(7,2), ws_coupon_amt decimal(7,2), ws_ext_ship_cost
> decimal(7,2), ws_net_paid decimal(7,2), ws_net_paid_inc_tax decimal(7,2),
> ws_net_paid_inc_ship decimal(7,2), ws_net_paid_inc_ship_tax decimal(7,2),
> ws_net_profit decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES
> ('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk,
> ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk,
> ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk, ws_ship_hdemo_sk,
> ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk, ws_ship_mode_sk,
> ws_warehouse_sk, ws_promo_sk, ws_order_number')");
>
>
>
> and here is my script for generate tpc-ds data:
> [hadoop@master tools]$ ./dsdgen -scale 1 -suffix '.csv' -dir
> /data/tpc-ds/data/
>
>
>
>
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ravindra Pesala";<[hidden email]>;
> 发送时间: 2017年2月16日(星期四) 下午3:15
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: 回复: data lost when loading data from csv file to carbon table
>
>
>
> Hi Yinwei,
>
> Can you provide create table scripts for both the tables store_returns and
> web_sales.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 10:07, Ravindra Pesala <[hidden email]>
> wrote:
>
> > Hi Yinwei,
> >
> > Thank you for pointing out the issue, I will check with TPC-DS data and
> > verify the data load with new flow.
> >
> > Regards,
> > Ravindra.
> >
> > On 16 February 2017 at 09:35, QiangCai <[hidden email]> wrote:
> >
> >> Maybe you can check PR594, it will fix a bug which will impact the
> result
> >> of
> >> loading.
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-maili
> >> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load
> >> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: 回复: data lost when loading data from csv file to carbon table

ravipesala
In reply to this post by David CaiQiang
Hi QiangCai,

PR594 fix does not solve the data lost issue, it fixes the data mismatch in
some cases.

Regards,
Ravindra.

On 16 February 2017 at 09:35, QiangCai <[hidden email]> wrote:

> Maybe you can check PR594, it will fix a bug which will impact the result
> of
> loading.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/data-lost-when-
> loading-data-from-csv-file-to-carbon-table-tp7554p7639.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: data lost when loading data from csv file to carbon table

Liang Chen
Administrator
In reply to this post by 李寅威
Hi

Already raised one JIAR issue:How to handle the bad records.
https://issues.apache.org/jira/browse/CARBONDATA-714

Regards
Liang