Hi All - I came across 2 weird issues while trying to import data into a
carbon table from another one stored in a parquet format: I've got 2 table with struct columns: CREATE TABLE `vds_carbon` (`variant` STRUCT<`contig`: STRING, `start`: INT, `ref`: STRING, `altAlleles`: ARRAY<STRUCT<`ref`: STRING, `alt`: STRING>>>) USING org.apache.spark.sql.CarbonSource OPTIONS ( `dbName` 'default', `carbonSchemaPartsNo` '1', `serialization.format` '1', `tableName` 'vds_carbon', `tablePath` '/default/vds_carbon' ) and the other one with variant column flattened: CREATE TABLE `vds_carbon_flat` (`contig` STRING, `start` INT, `ref` STRING, `altAlleles` ARRAY<STRUCT<`ref`: STRING, `alt`: STRING>>) USING org.apache.spark.sql.CarbonSource OPTIONS ( `dbName` 'default', `carbonSchemaPartsNo` '1', `serialization.format` '1', `tableName` 'vds_carbon_flat', `tablePath` '/default/vds_carbon_flat' ) And when I try to impor data into these 2 table I got the following exceptions: 1)in the first case: carbon.sql("insert into vds_carbon select variant from vds_parquet limit 10") 17/08/28 10:58:25 WARN CarbonDataProcessorUtil: Executor task launch worker-48 sort scope is set to LOCAL_SORT 17/08/28 10:58:25 INFO AbstractDataLoadProcessorStep: Thread-56 Rows processed in step Data Writer : 0 17/08/28 10:58:25 INFO AbstractDataLoadProcessorStep: Thread-57 Rows processed in step Data Converter : 0 17/08/28 10:58:25 INFO AbstractDataLoadProcessorStep: Thread-58 Rows processed in step Input Processor : 0 17/08/28 10:58:25 ERROR DataLoadExecutor: Executor task launch worker-48 Data Loading failed for table vds_carbon java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.carbondata.processing.newflow.parser.CarbonParserFactory.createParser(CarbonParserFactory.java:69) at org.apache.carbondata.processing.newflow.parser.CarbonParserFactory.createParser(CarbonParserFactory.java:62) at org.apache.carbondata.processing.newflow.parser.CarbonParserFactory.createParser(CarbonParserFactory.java:71) at org.apache.carbondata.processing.newflow.parser.CarbonParserFactory.createParser(CarbonParserFactory.java:38) at org.apache.carbondata.processing.newflow.parser.impl.RowParserImpl.<init>(RowParserImpl.java:46) at org.apache.carbondata.processing.newflow.steps.InputProcessorStepImpl.initialize(InputProcessorStepImpl.java:66) at org.apache.carbondata.processing.newflow.steps.DataConverterProcessorStepImpl.initialize(DataConverterProcessorStepImpl.java:65) at org.apache.carbondata.processing.newflow.steps.CarbonRowDataWriterProcessorStepImpl.initialize(CarbonRowDataWriterProcessorStepImpl.java:87) at org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:46) at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2.<init>(NewCarbonDataLoadRDD.scala:442) at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.internalCompute(NewCarbonDataLoadRDD.scala:405) at org.apache.carbondata.spark.rdd.CarbonRDD.compute(CarbonRDD.scala:62) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/08/28 10:58:25 INFO NewDataFrameLoaderRDD: DataLoad failure org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException: Data Loading failed for table vds_carbon at org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:62) at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2.<init>(NewCarbonDataLoadRDD.scala:442) at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.internalCompute(NewCarbonDataLoadRDD.scala:405) at org.apache.carbondata.spark.rdd.CarbonRDD.compute(CarbonRDD.scala:62) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 2) in the case of the other table: carbon.sql("insert into vds_carbon_flat select variant.* from vds_parquet") 17/08/28 11:01:40 ERROR SortTempFileChunkHolder: pool-514-thread-1 java.lang.NegativeArraySizeException at org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.getRowFromStream(SortTempFileChunkHolder.java:324) at org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.prefetchRecordsFromFile(SortTempFileChunkHolder.java:518) at org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.access$500(SortTempFileChunkHolder.java:42) at org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder$DataFetcher.call(SortTempFileChunkHolder.java:497) at org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.initialise(SortTempFileChunkHolder.java:227) at org.apache.carbondata.processing.sortandgroupby.sortdata.SortTempFileChunkHolder.initialize(SortTempFileChunkHolder.java:215) at org.apache.carbondata.processing.store.SingleThreadFinalSortFilesMerger$2.run(SingleThreadFinalSortFilesMerger.java:210) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/08/28 11:01:40 ERROR SortTempFileChunkHolder: pool-514-thread-1 I think there might be 2 issues here: one potentially with wrong structure of a table storing delimiters for nested structures and the other one related to sorting. Could you please take a look at it and help me to resolve them? We are working on PoC for genomics and we do believe that CarbonData can be here a perfect fit. Thanks in advance. Marek |
Hi Marek,
From the logs it seems that this is a bug in the code. You can raise a jira to track the issue. Regards Manish Gupta |
Free forum by Nabble | Edit this page |