2 problems with loading data into a carbon table
Posted by Marek Wiewiorka on Aug 28, 2017; 9:06am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/2-problems-with-loading-data-into-a-carbon-table-tp20819.html
Hi All - I came across 2 weird issues while trying to import data into a
carbon table from another one stored in a parquet format:
I've got 2 table with struct columns:
CREATE TABLE `vds_carbon` (`variant` STRUCT<`contig`: STRING, `start`: INT,
`ref`: STRING, `altAlleles`: ARRAY<STRUCT<`ref`: STRING, `alt`: STRING>>>)
USING org.apache.spark.sql.CarbonSource
OPTIONS (
`dbName` 'default',
`carbonSchemaPartsNo` '1',
`serialization.format` '1',
`tableName` 'vds_carbon',
`tablePath` '/default/vds_carbon'
)
and the other one with variant column flattened:
CREATE TABLE `vds_carbon_flat` (`contig` STRING, `start` INT, `ref` STRING,
`altAlleles` ARRAY<STRUCT<`ref`: STRING, `alt`: STRING>>)
USING org.apache.spark.sql.CarbonSource
OPTIONS (
`dbName` 'default',
`carbonSchemaPartsNo` '1',
`serialization.format` '1',
`tableName` 'vds_carbon_flat',
`tablePath` '/default/vds_carbon_flat'
)
And when I try to impor data into these 2 table I got the following
exceptions:
1)in the first case:
carbon.sql("insert into vds_carbon select variant from vds_parquet limit
10")
17/08/28 10:58:25 WARN CarbonDataProcessorUtil: Executor task launch
worker-48 sort scope is set to LOCAL_SORT
17/08/28 10:58:25 INFO AbstractDataLoadProcessorStep: Thread-56 Rows
processed in step Data Writer : 0
17/08/28 10:58:25 INFO AbstractDataLoadProcessorStep: Thread-57 Rows
processed in step Data Converter : 0
17/08/28 10:58:25 INFO AbstractDataLoadProcessorStep: Thread-58 Rows
processed in step Input Processor : 0
17/08/28 10:58:25 ERROR DataLoadExecutor: Executor task launch worker-48
Data Loading failed for table vds_carbon
java.lang.ArrayIndexOutOfBoundsException: 2
at
org.apache.carbondata.processing.newflow.parser.CarbonParserFactory.createParser(CarbonParserFactory.java:69)
at
org.apache.carbondata.processing.newflow.parser.CarbonParserFactory.createParser(CarbonParserFactory.java:62)
at
org.apache.carbondata.processing.newflow.parser.CarbonParserFactory.createParser(CarbonParserFactory.java:71)
at
org.apache.carbondata.processing.newflow.parser.CarbonParserFactory.createParser(CarbonParserFactory.java:38)
at
org.apache.carbondata.processing.newflow.parser.impl.RowParserImpl.<init>(RowParserImpl.java:46)
at
org.apache.carbondata.processing.newflow.steps.InputProcessorStepImpl.initialize(InputProcessorStepImpl.java:66)
at
org.apache.carbondata.processing.newflow.steps.DataConverterProcessorStepImpl.initialize(DataConverterProcessorStepImpl.java:65)
at
org.apache.carbondata.processing.newflow.steps.CarbonRowDataWriterProcessorStepImpl.initialize(CarbonRowDataWriterProcessorStepImpl.java:87)
at
org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:46)
at
org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2.<init>(NewCarbonDataLoadRDD.scala:442)
at
org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.internalCompute(NewCarbonDataLoadRDD.scala:405)
at org.apache.carbondata.spark.rdd.CarbonRDD.compute(CarbonRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/08/28 10:58:25 INFO NewDataFrameLoaderRDD: DataLoad failure
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException:
Data Loading failed for table vds_carbon
at
org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:62)
at
org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2.<init>(NewCarbonDataLoadRDD.scala:442)
at
org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.internalCompute(NewCarbonDataLoadRDD.scala:405)
at org.apache.carbondata.spark.rdd.CarbonRDD.compute(CarbonRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
2) in the case of the other table:
carbon.sql("insert into vds_carbon_flat select variant.* from vds_parquet")
17/08/28 11:01:40 ERROR SortTempFileChunkHolder: pool-514-thread-1
java.lang.NegativeArraySizeException
at
org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.getRowFromStream(SortTempFileChunkHolder.java:324)
at
org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.prefetchRecordsFromFile(SortTempFileChunkHolder.java:518)
at
org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.access$500(SortTempFileChunkHolder.java:42)
at
org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder$DataFetcher.call(SortTempFileChunkHolder.java:497)
at
org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.initialise(SortTempFileChunkHolder.java:227)
at
org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.initialize(SortTempFileChunkHolder.java:215)
at
org.apache.carbondata.processing.store.SingleThreadFinalSortFilesMerger$2.run(SingleThreadFinalSortFilesMerger.java:210)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/08/28 11:01:40 ERROR SortTempFileChunkHolder: pool-514-thread-1
I think there might be 2 issues here:
one potentially with wrong structure of a table storing delimiters for
nested structures
and the other one related to sorting.
Could you please take a look at it and help me to resolve them?
We are working on PoC for genomics and we do believe that CarbonData can be
here a perfect fit.
Thanks in advance.
Marek