[ https://issues.apache.org/jira/browse/CARBONDATA-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anubhav tarar reassigned CARBONDATA-2112: ----------------------------------------- Assignee: anubhav tarar > Data getting garbled after datamap creation when table is created with GLOBAL SORT > ---------------------------------------------------------------------------------- > > Key: CARBONDATA-2112 > URL: https://issues.apache.org/jira/browse/CARBONDATA-2112 > Project: CarbonData > Issue Type: Bug > Components: data-query > Environment: spark-2.1 > Reporter: Sangeeta Gulia > Assignee: anubhav tarar > Priority: Minor > Attachments: 2000_UniqData.csv > > > Data is getting garbled after datamap creation when table is created with BATCH_SORT/GLOBAL_SORT. > > Steps to reproduce : > spark.sql("drop table if exists uniqdata_batchsort_compact3") > spark.sql("CREATE TABLE uniqdata_batchsort_compact3 (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 int) STORED BY 'carbondata' TBLPROPERTIES('SORT_SCOPE'='GLOBAL_SORT')").show() > spark.sql("LOAD DATA INPATH '/home/sangeeta/Desktop/2000_UniqData.csv' into table " + > "uniqdata_batchsort_compact3 OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='\"'," + > "'BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION," + > "DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2," + > "Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1','batch_sort_size_inmb'='1')") > spark.sql("LOAD DATA INPATH '/home/sangeeta/Desktop/2000_UniqData.csv' into table " + > "uniqdata_batchsort_compact3 OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='\"'," + > "'BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION," + > "DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2," + > "Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1','batch_sort_size_inmb'='1')") > spark.sql("LOAD DATA INPATH '/home/sangeeta/Desktop/2000_UniqData.csv' into table " + > "uniqdata_batchsort_compact3 OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='\"'," + > "'BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION," + > "DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2," + > "Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1','batch_sort_size_inmb'='1')") > spark.sql("select cust_id, avg(cust_id) from uniqdata_batchsort_compact3 group by cust_id ").show(50) > +-------+------------+ > |cust_id|avg(cust_id)| > +-------+------------+ > | 9376| 9376.0| > | 9427| 9427.0| > | 9465| 9465.0| > | 9852| 9852.0| > | 9900| 9900.0| > | 10206| 10206.0| > | 10362| 10362.0| > | 10623| 10623.0| > | 10817| 10817.0| > | 9182| 9182.0| > | 9564| 9564.0| > | 9879| 9879.0| > | 10081| 10081.0| > | 10121| 10121.0| > | 10230| 10230.0| > | 10462| 10462.0| > | 10703| 10703.0| > | 10914| 10914.0| > | 9162| 9162.0| > | 9383| 9383.0| > | 9454| 9454.0| > | 9517| 9517.0| > | 9558| 9558.0| > | 10708| 10708.0| > | 10798| 10798.0| > | 10862| 10862.0| > | 9071| 9071.0| > | 9169| 9169.0| > | 9946| 9946.0| > | 10468| 10468.0| > | 10745| 10745.0| > | 10768| 10768.0| > | 9153| 9153.0| > | 9206| 9206.0| > | 9403| 9403.0| > | 9597| 9597.0| > | 9647| 9647.0| > | 9775| 9775.0| > | 10032| 10032.0| > | 10395| 10395.0| > | 10527| 10527.0| > | 10567| 10567.0| > | 10632| 10632.0| > | 10788| 10788.0| > | 10815| 10815.0| > | 10840| 10840.0| > | 9181| 9181.0| > | 9344| 9344.0| > | 9575| 9575.0| > | 9675| 9675.0| > +-------+------------+ > only showing top 50 rows > Note: Here the cust_id is coming correct . > spark.sql("create datamap uniqdata_agg on table uniqdata_batchsort_compact3 using " + > "'preaggregate' as select avg(cust_id) from uniqdata_batchsort_compact3 group by cust_id") > spark.sql("select cust_id, avg(cust_id) from uniqdata_batchsort_compact3 group by cust_id ").show(50) > +-------+------------+ > |cust_id|avg(cust_id)| > +-------+------------+ > | 27651| 9217.0| > | 31944| 10648.0| > | 32667| 10889.0| > | 28242| 9414.0| > | 29841| 9947.0| > | 28728| 9576.0| > | 27255| 9085.0| > | 32571| 10857.0| > | 30276| 10092.0| > | 27276| 9092.0| > | 31503| 10501.0| > | 27687| 9229.0| > | 27183| 9061.0| > | 29334| 9778.0| > | 29913| 9971.0| > | 28683| 9561.0| > | 31545| 10515.0| > | 30405| 10135.0| > | 27693| 9231.0| > | 29649| 9883.0| > | 30537| 10179.0| > | 32709| 10903.0| > | 29586| 9862.0| > | 32895| 10965.0| > | 32415| 10805.0| > | 31644| 10548.0| > | 30030| 10010.0| > | 31713| 10571.0| > | 28083| 9361.0| > | 27813| 9271.0| > | 27171| 9057.0| > | 27189| 9063.0| > | 30444| 10148.0| > | 28623| 9541.0| > | 28566| 9522.0| > | 32655| 10885.0| > | 31164| 10388.0| > | 30321| 10107.0| > | 31452| 10484.0| > | 29829| 9943.0| > | 27468| 9156.0| > | 31212| 10404.0| > | 32154| 10718.0| > | 27531| 9177.0| > | 27654| 9218.0| > | 27105| 9035.0| > | 31113| 10371.0| > | 28479| 9493.0| > | 29094| 9698.0| > | 31551| 10517.0| > +-------+------------+ > only showing top 50 rows > Note: But after datamap creation, cust_id is coming incorrect. It is coming as thrice(equivalent to number of loads) of its original value and avg(cust_id) is correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) |
Free forum by Nabble | Edit this page |