[jira] [Commented] (CARBONDATA-1387) Incorrect partition creation while inserting data from another table

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CARBONDATA-1387) Incorrect partition creation while inserting data from another table

Akash R Nilugal (Jira)

    [ https://issues.apache.org/jira/browse/CARBONDATA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146793#comment-16146793 ]

Prabhat Kashyap commented on CARBONDATA-1387:
---------------------------------------------

Hello Vandana,

This is not the bug of partition but the problem is data is being inserted into one table from another. If you insert data from another table then schema of both table should match otherwise there are chances of getting bad records, that is what is happening here. The data goes into the bad record (null) and the select query is showing the different result as expected.

*Possible solution:*
# Throw an exception while inserting data and schema mismatch.
# Continue with the bad record inserted into the table when schema mismatched.

> Incorrect partition creation while inserting data from another table
> --------------------------------------------------------------------
>
>                 Key: CARBONDATA-1387
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1387
>             Project: CarbonData
>          Issue Type: Bug
>          Components: data-query
>    Affects Versions: 1.2.0
>         Environment: spark 2.1
>            Reporter: Vandana Yadav
>            Assignee: Prabhat Kashyap
>
> Incorrect partition creation while inserting data from another table.
> Description: While inserting data from another table no of rows in each partition remain same although no of rows in partitioned table get increased.(no of rows in each partition should also increase as we are inserting new data into the partitioned table.
> Steps to reproduce:
> 1) Create Partitioned table:
> CREATE TABLE uniqdata_part (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION string,DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 int) PARTITIONED BY (DOB Timestamp) STORED BY 'org.apache.carbondata.format' TBLPROPERTIES ('PARTITION_TYPE'='RANGE','RANGE_INFO'='1971-01-01 01:00:03, 1972-01-01 01:00:03, 1974-01-01 01:00:03',"TABLE_BLOCKSIZE"= "256 MB")
> 2) Load data into the partitioned table:
> LOAD DATA INPATH 'hdfs://localhost:54310/uniqdata/2000_UniqData.csv' into table uniqdata_part OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION,DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2,Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1')
> 3) Create another table:
> CREATE TABLE uniqdata_1 (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 int) STORED BY 'org.apache.carbondata.format' TBLPROPERTIES ("TABLE_BLOCKSIZE"= "256 MB")
> 4) Load data into this table:
> LOAD DATA INPATH 'hdfs://localhost:54310/uniqdata/2000_UniqData.csv' into table uniqdata_1 OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION,DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2,Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1')
> 5)Execute Queries:
> a) show partitions query:
> show partitions uniqdata_part
> Output:
> 0, dob = DEFAULT                                    
>  1, dob < 1971-01-01 01:00:03                        
>  2, 1971-01-01 01:00:03 <= dob < 1972-01-01 01:00:03  
>  3, 1972-01-01 01:00:03 <= dob < 1974-01-01 01:00:03
> b) Query for row count in the partitioned table:
> select count(*) from uniqdata_part
> Output:
>  count(1)  |
> +-----------+--+
> | 2013
> c)query for row count in partition 0:
> select count(*) from uniqdata_part where dob >= '1974-01-01 01:00:03'
> Output:
>  count(1)  |
> +-----------+--+
> | 539
> d) query for row count in partition 1 :
> select count(*) from uniqdata_part where dob < '1971-01-01 01:00:03'
> Output:
>  count(1)  |
> +-----------+--+
> | 366  
> e) query for row count in partition 3:
> select count(*) from uniqdata_part where dob >= '1971-01-01 01:00:03' and dob < '1972-01-01 01:00:03'
> Output:
>  count(1)  |
> +-----------+--+
> | 365
> f) query for row count in partition 4:
> select count(*) from uniqdata_part where dob >= '1972-01-01 01:00:03' and dob < '1974-01-01 01:00:03'
> Output:
> count(1)  |
> +-----------+--+
> | 731
> g) Insert data in partitioned table through the normal table:
> insert into uniqdata_part select * from uniqdata_1;
> h) Query for row count in the partitioned table after insertion operation:
> select count(*) from uniqdata_part
> Output:
>  count(1)  |
> +-----------+--+
> | 4026  
> i) Repeat queries from (c) to (f) for row count in the each partition.
> 6) Actual Result: it shows the same row count for each partition but the partitioned table has more rows in it.
> 7)Expected Result: No of rows in each partition should increase as no of rows increases in partitioned table



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)