Apache CarbonData Dev Mailing List archive

data lost when loading data from csv file to carbon table

Posted by 李寅威 on Feb 14, 2017; 9:54am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/data-lost-when-loading-data-from-csv-file-to-carbon-table-tp7554.html

Hi all,

I met an data lost problem when loading data from csv file to carbon table, here are some details:

Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
Total Records:719,384
Loaded Records:606,305 (SQL: select count(1) from table)

My Attemps:

Attemp1: Add option bad_records_action='force' when loading data. It also doesn't work, it's count equals to 606,305;
Attemp2: Cut line 1 to 300,000 into a csv file and load, the result is right, which equals to 300,000;
Attemp3: Cut line 1 to 350,000 into a csv file and load, the result is wrong, it equals to 305,631;
Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the result is right, it equals to 50,000;
Attemp5: Count the separator '|' of my csv file, it equals to lines * columns, so the source data may in the correct format;

In spark log, each attemp logs out : "Bad Record Found".

Anyone have any ideas?