Posted by
李寅威 on
Jan 09, 2017; 9:08am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/load-data-error-from-csv-file-at-hdfs-error-in-standalone-spark-cluster-tp5783p5791.html
Hi all:
when I load data from hdfs to a table:
cc.sql(s"load data inpath 'hdfs://master:9000/home/hadoop/sample.csv' into table test_table")
two errors occured, at slave1:
INFO 09-01 16:17:58,611 - test_table: Graph - CSV Input *****************Started all csv reading*********** INFO 09-01 16:17:58,611 - [pool-20-thread-1][partitionID:PROCESS_BLOCKS;queryID:pool-20-thread-1] *****************started csv reading by thread*********** INFO 09-01 16:17:58,635 - [pool-20-thread-1][partitionID:PROCESS_BLOCKS;queryID:pool-20-thread-1] Total Number of records processed by this thread is: 3 INFO 09-01 16:17:58,635 - [pool-20-thread-1][partitionID:PROCESS_BLOCKS;queryID:pool-20-thread-1] Time taken to processed 3 Number of records: 24 INFO 09-01 16:17:58,636 - [pool-20-thread-1][partitionID:PROCESS_BLOCKS;queryID:pool-20-thread-1] *****************Completed csv reading by thread*********** INFO 09-01 16:17:58,636 - test_table: Graph - CSV Input *****************Completed all csv reading*********** INFO 09-01 16:17:58,642 - [test_table: Graph - Carbon Surrogate Key Generator][partitionID:0] Column cache size not configured. Therefore default behavior will be considered and no LRU based eviction of columns will be done ERROR 09-01 16:17:58,645 - [test_table: Graph - Carbon Surrogate Key Generator][partitionID:0] org.apache.carbondata.core.util.CarbonUtilException: Either dictionary or its metadata does not exist for column identifier :: ColumnIdentifier [columnId=c70480f9-4336-4186-8bd0-a3bebb50ea6a] ERROR 09-01 16:17:58,646 - [test_table: Graph - Carbon Surrogate Key Generator][partitionID:0] org.pentaho.di.core.exception.KettleException: org.apache.carbondata.core.util.CarbonUtilException: Either dictionary or its metadata does not exist for column identifier :: ColumnIdentifier [columnId=c70480f9-4336-4186-8bd0-a3bebb50ea6a] at org.apache.carbondata.processing.surrogatekeysgenerator.csvbased.FileStoreSurrogateKeyGenForCSV.initDictionaryCacheInfo(FileStoreSurrogateKeyGenForCSV.java:297) at org.apache.carbondata.processing.surrogatekeysgenerator.csvbased.FileStoreSurrogateKeyGenForCSV.populateCache(FileStoreSurrogateKeyGenForCSV.java:270) at org.apache.carbondata.processing.surrogatekeysgenerator.csvbased.FileStoreSurrogateKeyGenForCSV.<init>(FileStoreSurrogateKeyGenForCSV.java:144) at org.apache.carbondata.processing.surrogatekeysgenerator.csvbased.CarbonCSVBasedSeqGenStep.processRow(CarbonCSVBasedSeqGenStep.java:385) at org.pentaho.di.trans.step.RunThread.run(RunThread.java:50) at java.lang.Thread.run(Thread.java:745) INFO 09-01 16:17:58,647 - [test_table: Graph - Carbon Slice Mergertest_table][partitionID:table] Record Procerssed For table: test_table INFO 09-01 16:17:58,647 - [test_table: Graph - Carbon Slice Mergertest_table][partitionID:table] Summary: Carbon Slice Merger Step: Read: 0: Write: 0 INFO 09-01 16:17:58,647 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] Record Processed For table: test_table INFO 09-01 16:17:58,647 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] Number of Records was Zero INFO 09-01 16:17:58,647 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] Summary: Carbon Sort Key Step: Read: 0: Write: 0 INFO 09-01 16:17:58,747 - [Executor task launch worker-0][partitionID:default_test_table_632e80a6-77ef-44b2-aed7-2e5bbf56610e] Graph execution is finished. ERROR 09-01 16:17:58,748 - [Executor task launch worker-0][partitionID:default_test_table_632e80a6-77ef-44b2-aed7-2e5bbf56610e] Graph Execution had errors INFO 09-01 16:17:58,749 - [Executor task launch worker-0][partitionID:default_test_table_632e80a6-77ef-44b2-aed7-2e5bbf56610e] Deleted the local store location/tmp/259202084415620/0 INFO 09-01 16:17:58,749 - DataLoad complete INFO 09-01 16:17:58,749 - Data Loaded successfully with LoadCount:0 INFO 09-01 16:17:58,749 - DataLoad failure ERROR 09-01 16:17:58,749 - [Executor task launch worker-0][partitionID:default_test_table_632e80a6-77ef-44b2-aed7-2e5bbf56610e] org.apache.carbondata.processing.etl.DataLoadingException: Due to internal errors, please check logs for more details. at org.apache.carbondata.processing.csvload.DataGraphExecuter.execute(DataGraphExecuter.java:212) at org.apache.carbondata.processing.csvload.DataGraphExecuter.executeGraph(DataGraphExecuter.java:144) at org.apache.carbondata.spark.load.CarbonLoaderUtil.executeGraph(CarbonLoaderUtil.java:212) at org.apache.carbondata.spark.rdd.SparkPartitionLoader.run(CarbonDataLoadRDD.scala:125) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD$$anon$1.<init>(CarbonDataLoadRDD.scala:255) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD.compute(CarbonDataLoadRDD.scala:232) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ERROR 09-01 16:17:58,752 - Exception in task 0.3 in stage 3.0 (TID 8) org.apache.carbondata.processing.etl.DataLoadingException: Due to internal errors, please check logs for more details. at org.apache.carbondata.processing.csvload.DataGraphExecuter.execute(DataGraphExecuter.java:212) at org.apache.carbondata.processing.csvload.DataGraphExecuter.executeGraph(DataGraphExecuter.java:144) at org.apache.carbondata.spark.load.CarbonLoaderUtil.executeGraph(CarbonLoaderUtil.java:212) at org.apache.carbondata.spark.rdd.SparkPartitionLoader.run(CarbonDataLoadRDD.scala:125) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD$$anon$1.<init>(CarbonDataLoadRDD.scala:255) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD.compute(CarbonDataLoadRDD.scala:232) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at slave2:
INFO 09-01 16:17:55,182 - [test_table: Graph - MDKeyGentest_table][partitionID:0] Copying /tmp/259188927254235/0/default/test_table/Fact/Part0/Segment_0/0/part-0-0-1483949874000.carbondata --> /home/hadoop/carbondata/bin/carbonshellstore/default/test_table/Fact/Part0/Segment_0 INFO 09-01 16:17:55,182 - [test_table: Graph - MDKeyGentest_table][partitionID:0] The configured block size is 1024 MB, the actual carbon file size is 921 Byte, choose the max value 1024 MB as the block size on HDFS ERROR 09-01 16:17:55,183 - [test_table: Graph - MDKeyGentest_table][partitionID:0] Problem while copying file from local store to carbon store org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException: Problem while copying file from local store to carbon store at org.apache.carbondata.processing.store.writer.AbstractFactDataWriter.copyCarbonDataFileToCarbonStorePath(AbstractFactDataWriter.java:604) at org.apache.carbondata.processing.store.writer.AbstractFactDataWriter.closeWriter(AbstractFactDataWriter.java:510) at org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.closeHandler(CarbonFactDataHandlerColumnar.java:879) at org.apache.carbondata.processing.mdkeygen.MDKeyGenStep.processingComplete(MDKeyGenStep.java:245) at org.apache.carbondata.processing.mdkeygen.MDKeyGenStep.processRow(MDKeyGenStep.java:234) at org.pentaho.di.trans.step.RunThread.run(RunThread.java:50) at java.lang.Thread.run(Thread.java:745) INFO 09-01 16:17:55,184 - [test_table: Graph - Carbon Slice Mergertest_table][partitionID:table] Record Procerssed For table: test_table INFO 09-01 16:17:55,184 - [test_table: Graph - Carbon Slice Mergertest_table][partitionID:table] Summary: Carbon Slice Merger Step: Read: 1: Write: 0 INFO 09-01 16:17:55,284 - [Executor task launch worker-0][partitionID:default_test_table_c3017cd2-8920-488d-a715-c0d02250148e] Graph execution is finished. ERROR 09-01 16:17:55,284 - [Executor task launch worker-0][partitionID:default_test_table_c3017cd2-8920-488d-a715-c0d02250148e] Graph Execution had errors INFO 09-01 16:17:55,285 - [Executor task launch worker-0][partitionID:default_test_table_c3017cd2-8920-488d-a715-c0d02250148e] Deleted the local store location/tmp/259188927254235/0 INFO 09-01 16:17:55,285 - DataLoad complete INFO 09-01 16:17:55,286 - Data Loaded successfully with LoadCount:0 INFO 09-01 16:17:55,286 - DataLoad failure ERROR 09-01 16:17:55,286 - [Executor task launch worker-0][partitionID:default_test_table_c3017cd2-8920-488d-a715-c0d02250148e] org.apache.carbondata.processing.etl.DataLoadingException: Due to internal errors, please check logs for more details. at org.apache.carbondata.processing.csvload.DataGraphExecuter.execute(DataGraphExecuter.java:212) at org.apache.carbondata.processing.csvload.DataGraphExecuter.executeGraph(DataGraphExecuter.java:144) at org.apache.carbondata.spark.load.CarbonLoaderUtil.executeGraph(CarbonLoaderUtil.java:212) at org.apache.carbondata.spark.rdd.SparkPartitionLoader.run(CarbonDataLoadRDD.scala:125) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD$$anon$1.<init>(CarbonDataLoadRDD.scala:255) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD.compute(CarbonDataLoadRDD.scala:232) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ERROR 09-01 16:17:55,288 - Exception in task 0.0 in stage 3.0 (TID 5) org.apache.carbondata.processing.etl.DataLoadingException: Due to internal errors, please check logs for more details. at org.apache.carbondata.processing.csvload.DataGraphExecuter.execute(DataGraphExecuter.java:212) at org.apache.carbondata.processing.csvload.DataGraphExecuter.executeGraph(DataGraphExecuter.java:144) at org.apache.carbondata.spark.load.CarbonLoaderUtil.executeGraph(CarbonLoaderUtil.java:212) at org.apache.carbondata.spark.rdd.SparkPartitionLoader.run(CarbonDataLoadRDD.scala:125) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD$$anon$1.<init>(CarbonDataLoadRDD.scala:255) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD.compute(CarbonDataLoadRDD.scala:232) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) INFO 09-01 16:17:55,926 - Got assigned task 7 INFO 09-01 16:17:55,926 - Running task 0.2 in stage 3.0 (TID 7) INFO 09-01 16:17:55,930 - Input split: slave2 INFO 09-01 16:17:55,930 - The Block Count in this node :1 INFO 09-01 16:17:55,931 - [Executor task launch worker-0][partitionID:default_test_table_fa5212b0-3e3c-43e1-ae5e-27396dce020c] ************* Is Columnar Storagetrue INFO 09-01 16:17:56,011 - [Executor task launch worker-0][partitionID:default_test_table_fa5212b0-3e3c-43e1-ae5e-27396dce020c] Kettle environment initialized INFO 09-01 16:17:56,027 - [Executor task launch worker-0][partitionID:default_test_table_fa5212b0-3e3c-43e1-ae5e-27396dce020c] ** Using csv file ** INFO 09-01 16:17:56,035 - [Executor task launch worker-0][partitionID:default_test_table_fa5212b0-3e3c-43e1-ae5e-27396dce020c] Graph execution is started /tmp/259190107897964/0/etl/default/test_table/0/0/test_table.ktr INFO 09-01 16:17:56,035 - test_table: Graph - CSV Input *****************Started all csv reading*********** INFO 09-01 16:17:56,035 - [pool-31-thread-1][partitionID:PROCESS_BLOCKS;queryID:pool-31-thread-1] *****************started csv reading by thread*********** INFO 09-01 16:17:56,040 - [pool-31-thread-1][partitionID:PROCESS_BLOCKS;queryID:pool-31-thread-1] Total Number of records processed by this thread is: 3 INFO 09-01 16:17:56,041 - [pool-31-thread-1][partitionID:PROCESS_BLOCKS;queryID:pool-31-thread-1] Time taken to processed 3 Number of records: 6 INFO 09-01 16:17:56,041 - [pool-31-thread-1][partitionID:PROCESS_BLOCKS;queryID:pool-31-thread-1] *****************Completed csv reading by thread*********** INFO 09-01 16:17:56,041 - test_table: Graph - CSV Input *****************Completed all csv reading*********** INFO 09-01 16:17:56,043 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] Sort size for table: 500000 INFO 09-01 16:17:56,043 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] Number of intermediate file to be merged: 20 INFO 09-01 16:17:56,043 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] File Buffer Size: 1048576 INFO 09-01 16:17:56,043 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] temp file location/tmp/259190107897964/0/default/test_table/Fact/Part0/Segment_0/0/sortrowtmp INFO 09-01 16:17:56,046 - [test_table: Graph - Carbon Surrogate Key Generator][partitionID:0] Level cardinality file written to : /tmp/259190107897964/0/default/test_table/Fact/Part0/Segment_0/0/levelmetadata_test_table.metadata INFO 09-01 16:17:56,046 - [test_table: Graph - Carbon Surrogate Key Generator][partitionID:0] Record Procerssed For table: test_table INFO 09-01 16:17:56,047 - [test_table: Graph - Carbon Surrogate Key Generator][partitionID:0] Summary: Carbon CSV Based Seq Gen Step : 3: Write: 3 INFO 09-01 16:17:56,049 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] File based sorting will be used INFO 09-01 16:17:56,049 - [test_table: Graph - Sort Key: Sort keystest_table][partitionID:0] Record Processed For table: test_table
it seems as IOException, the source code is as follows:
/**
* This method will copy the given file to carbon store location
*
* @param localFileName local file name with full path
* @throws CarbonDataWriterException
*/
private void copyCarbonDataFileToCarbonStorePath(String localFileName)
throws CarbonDataWriterException {
long copyStartTime = System.currentTimeMillis();
LOGGER.info("Copying " + localFileName + " --> " + dataWriterVo.getCarbonDataDirectoryPath());
try {
CarbonFile localCarbonFile =
FileFactory.getCarbonFile(localFileName, FileFactory.getFileType(localFileName));
String carbonFilePath = dataWriterVo.getCarbonDataDirectoryPath() + localFileName
.substring(localFileName.lastIndexOf(File.separator));
copyLocalFileToCarbonStore(carbonFilePath, localFileName,
CarbonCommonConstants.BYTEBUFFER_SIZE,
getMaxOfBlockAndFileSize(fileSizeInBytes, localCarbonFile.getSize()));
} catch (IOException e) {
throw new CarbonDataWriterException(
"Problem while copying file from local store to carbon store");
}
LOGGER.info(
"Total copy time (ms) to copy file " + localFileName + " is " + (System.currentTimeMillis()
- copyStartTime));
}
Environment:
Spark 1.6.2 standalone cluster + Carbondata 0.2.0 + Hadoop 2.7.2
would any of you can help me, thx~~
------------------ Original ------------------
From: "";<
[hidden email]>;
Date: Mon, Jan 9, 2017 03:56 PM
To: "dev"<
[hidden email]>;
Subject: load data error from csv file at hdfs error in standalone spark cluster
Hi all,
when I load data from hdfs csv file, a stage of spark job failed with the following error, where can I find a more detail error that can help me find the solution, or may some one know why this happen and how to solve it.
command:
cc.sql(s"load data inpath 'hdfs://master:9000/opt/sample.csv' into table test_table")
error log:
Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 17, slave2): org.apache.carbondata.processing.etl.DataLoadingException: Due to internal errors, please check logs for more details.
Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 17, slave2): org.apache.carbondata.processing.etl.DataLoadingException: Due to internal errors, please check logs for more details. at org.apache.carbondata.processing.csvload.DataGraphExecuter.execute(DataGraphExecuter.java:212) at org.apache.carbondata.processing.csvload.DataGraphExecuter.executeGraph(DataGraphExecuter.java:144) at org.apache.carbondata.spark.load.CarbonLoaderUtil.executeGraph(CarbonLoaderUtil.java:212) at org.apache.carbondata.spark.rdd.SparkPartitionLoader.run(CarbonDataLoadRDD.scala:125) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD$$anon$1.<init>(CarbonDataLoadRDD.scala:255) at org.apache.carbondata.spark.rdd.DataFileLoaderRDD.compute(CarbonDataLoadRDD.scala:232) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: