Posted by
ravipesala on
Feb 16, 2017; 7:52am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/data-lost-when-loading-data-from-csv-file-to-carbon-table-tp7554p7655.html
Hi,
I have generated tpcds data using
https://github.com/brownsys/tpcds .
And I have loaded the data using with kettle flow and with new flow, both
gives same number of rows after using select count(*) query on the table.
Even I have counted rows in excel file , it is matching with count query
Number of rows loaded
store_returns : 288279
web_sales : 718931
Scripts :
spark.sql("""
CREATE TABLE IF NOT EXISTS STORE_RETURNS
(sr_returned_date_sk bigint,
sr_return_time_sk bigint,
sr_item_sk bigint,
sr_customer_sk bigint,
sr_cdemo_sk bigint,
sr_hdemo_sk bigint,
sr_addr_sk bigint,
sr_store_sk bigint,
sr_reason_sk bigint,
sr_ticket_number bigint,
sr_return_quantity bigint,
sr_return_amt decimal(7,2),
sr_return_tax decimal(7,2),
sr_return_amt_inc_tax decimal(7,2),
sr_fee decimal(7,2),
sr_return_ship_cost decimal(7,2),
sr_refunded_cash decimal(7,2),
sr_reversed_charge decimal(7,2),
sr_store_credit decimal(7,2),
sr_net_loss decimal(7,2))
STORED BY 'carbondata'
TBLPROPERTIES('dictionary_include'='sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number')
""")
spark.sql("create table if not exists web_sales(ws_sold_date_sk
integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk
integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer,
ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk
integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer,
ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk
integer, ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk
integer, ws_order_number integer, ws_quantity integer,
ws_wholesale_cost decimal(7,2), ws_list_price decimal(7,2),
ws_sales_price decimal(7,2), ws_ext_discount_amt decimal(7,2),
ws_ext_sales_price decimal(7,2), ws_ext_wholesale_cost decimal(7,2),
ws_ext_list_price decimal(7,2), ws_ext_tax decimal(7,2), ws_coupon_amt
decimal(7,2), ws_ext_ship_cost decimal(7,2), ws_net_paid decimal(7,2),
ws_net_paid_inc_tax decimal(7,2), ws_net_paid_inc_ship decimal(7,2),
ws_net_paid_inc_ship_tax decimal(7,2), ws_net_profit decimal(7,2))
STORED BY 'carbondata' TBLPROPERTIES
('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk,
ws_ship_date_sk, ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk,
ws_bill_hdemo_sk, ws_bill_addr_sk, ws_ship_customer_sk,
ws_ship_cdemo_sk, ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk,
ws_web_site_sk, ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk,
ws_order_number')")
spark.sql(s"""
LOAD DATA LOCAL INPATH
'/home/root1/Downloads/store_returns.csv' into table STORE_RETURNS
options('DELIMITER'='|',
'FILEHEADER'='sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number,sr_return_quantity,sr_return_amt,sr_return_tax,sr_return_amt_inc_tax,sr_fee,sr_return_ship_cost,sr_refunded_cash,sr_reversed_charge,sr_store_credit,sr_net_loss',
'use_kettle'='false')
""")
spark.sql(s"""
LOAD DATA LOCAL INPATH
'hdfs://localhost:9000/tpcds/web_sales/part-r-00000-dca70590-4d9d-4cc9-aff4-e20b85970d2b'
into table web_sales options('DELIMITER'='|',
'FILEHEADER'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk,
ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk,
ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk,
ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk,
ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk, ws_order_number,
ws_quantity, ws_wholesale_cost, ws_list_price, ws_sales_price,
ws_ext_discount_amt, ws_ext_sales_price, ws_ext_wholesale_cost,
ws_ext_list_price, ws_ext_tax, ws_coupon_amt, ws_ext_ship_cost,
ws_net_paid, ws_net_paid_inc_tax, ws_net_paid_inc_ship,
ws_net_paid_inc_ship_tax, ws_net_profit', 'use_kettle'='false')
""")
Did I miss something here?
Regards,
Ravindra.
On 16 February 2017 at 12:24, Yinwei Li <
[hidden email]> wrote:
> Hi Ravindra:
>
>
> I add DICTIONARY_INCLUDE for each of them:
>
>
> carbon.sql("create table if not exists _1g.store_returns(sr_returned_date_sk
> integer, sr_return_time_sk integer, sr_item_sk integer, sr_customer_sk
> integer, sr_cdemo_sk integer, sr_hdemo_sk integer, sr_addr_sk integer,
> sr_store_sk integer, sr_reason_sk integer, sr_ticket_number integer,
> sr_return_quantity integer, sr_return_amt decimal(7,2), sr_return_tax
> decimal(7,2), sr_return_amt_inc_tax decimal(7,2), sr_fee decimal(7,2),
> sr_return_ship_cost decimal(7,2), sr_refunded_cash decimal(7,2),
> sr_reversed_charge decimal(7,2), sr_store_credit decimal(7,2), sr_net_loss
> decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES
> ('DICTIONARY_INCLUDE'='sr_returned_date_sk, sr_return_time_sk,
> sr_item_sk, sr_customer_sk, sr_cdemo_sk, sr_hdemo_sk, sr_addr_sk,
> sr_store_sk, sr_reason_sk, sr_ticket_number')");
>
>
>
>
> carbon.sql("create table if not exists _1g.web_sales(ws_sold_date_sk
> integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk
> integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer,
> ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk
> integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer,
> ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk integer,
> ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk integer,
> ws_order_number integer, ws_quantity integer, ws_wholesale_cost
> decimal(7,2), ws_list_price decimal(7,2), ws_sales_price decimal(7,2),
> ws_ext_discount_amt decimal(7,2), ws_ext_sales_price decimal(7,2),
> ws_ext_wholesale_cost decimal(7,2), ws_ext_list_price decimal(7,2),
> ws_ext_tax decimal(7,2), ws_coupon_amt decimal(7,2), ws_ext_ship_cost
> decimal(7,2), ws_net_paid decimal(7,2), ws_net_paid_inc_tax decimal(7,2),
> ws_net_paid_inc_ship decimal(7,2), ws_net_paid_inc_ship_tax decimal(7,2),
> ws_net_profit decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES
> ('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk,
> ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk,
> ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk, ws_ship_hdemo_sk,
> ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk, ws_ship_mode_sk,
> ws_warehouse_sk, ws_promo_sk, ws_order_number')");
>
>
>
> and here is my script for generate tpc-ds data:
> [hadoop@master tools]$ ./dsdgen -scale 1 -suffix '.csv' -dir
> /data/tpc-ds/data/
>
>
>
>
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ravindra Pesala";<
[hidden email]>;
> 发送时间: 2017年2月16日(星期四) 下午3:15
> 收件人: "dev"<
[hidden email]>;
>
> 主题: Re: 回复: data lost when loading data from csv file to carbon table
>
>
>
> Hi Yinwei,
>
> Can you provide create table scripts for both the tables store_returns and
> web_sales.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 10:07, Ravindra Pesala <
[hidden email]>
> wrote:
>
> > Hi Yinwei,
> >
> > Thank you for pointing out the issue, I will check with TPC-DS data and
> > verify the data load with new flow.
> >
> > Regards,
> > Ravindra.
> >
> > On 16 February 2017 at 09:35, QiangCai <
[hidden email]> wrote:
> >
> >> Maybe you can check PR594, it will fix a bug which will impact the
> result
> >> of
> >> loading.
> >>
> >>
> >>
> >> --
> >> View this message in context:
http://apache-carbondata-maili> >> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load
> >> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>
--
Thanks & Regards,
Ravi