Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/1660 @anubhav100 @sounakr you guys also can use my example script to reproduce. This example simulate 7500000 data, can reproduce 1728, and this pr also can fix this issue. please @sounakr double check it again. @anubhav100 i still have some queries, why need append "return true" after "blockletDetails.get(index).addDeletedRows(blocklet.getDeletedRows());" ? --------------------------------------------------------------------------------------- package org.apache.carbondata.examples import java.io.File import java.text.SimpleDateFormat import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SparkSession import org.apache.carbondata.core.constants.CarbonCommonConstants import org.apache.carbondata.core.util.CarbonProperties object DataUpdateDeleteExample { def main(args: Array[String]) { // for local files val rootPath = new File(this.getClass.getResource("/").getPath + "../../../..").getCanonicalPath // for hdfs files // var rootPath = "hdfs://hdfs-host/carbon" var storeLocation = s"$rootPath/examples/spark2/target/store" var warehouse = s"$rootPath/examples/spark2/target/warehouse" var metastoredb = s"$rootPath/examples/spark2/target" import org.apache.spark.sql.CarbonSession._ val spark = SparkSession .builder() .master("local") .appName("DataUpdateDeleteExample") .config("spark.sql.warehouse.dir", warehouse) .config("spark.driver.host", "localhost") .config("spark.sql.crossJoin.enabled", "true") .getOrCreateCarbonSession(storeLocation) spark.sparkContext.setLogLevel("WARN") // Specify date format based on raw data CarbonProperties.getInstance() .addProperty(CarbonCommonConstants.CARBON_DATE_FORMAT, "yyyy-MM-dd") import spark.implicits._ // Drop table spark.sql("DROP TABLE IF EXISTS t3") // Simulate data and write to table t3 var sdf = new SimpleDateFormat("yyyy-MM-dd") var df = spark.sparkContext.parallelize(1 to 7500000) .map(x => (x, new java.sql.Date(sdf.parse("2015-07-" + (x % 10 + 10)).getTime), "china", "aaa" + x, "phone" + 555 * x, "ASD" + (60000 + x), 14999 + x)) .toDF("t3_id", "t3_date", "t3_country", "t3_name", "t3_phonetype", "t3_serialname", "t3_salary") df.write .format("carbondata") .option("tableName", "t3") .option("tempCSV", "true") .option("compress", "true") .mode(SaveMode.Overwrite) .save() // Query data again after the above update spark.sql(""" SELECT * FROM t3 ORDER BY t3_id """).show() spark.sql("delete from t3 where exists (select 1 from t3)").show() spark.sql(""" SELECT count(*) FROM t3 """).show() // Drop table spark.sql("DROP TABLE IF EXISTS t3") spark.stop() } } --- |
In reply to this post by qiuchenjian-2
Github user anubhav100 commented on the issue:
https://github.com/apache/carbondata/pull/1660 @chenliang reason that i return true is that when blockletDetails.get(index).addDeletedRows(blocklet.getDeletedRows()) it was adding the same row again to deletedrows treeset in DeleteDeltaBlockletDetail class and we are validating the result of whether treeset can add the rows in it or not if these rows are duplicated it will not add them but will return me false,so what i analyze that is it is not required to check whether this treeset can add row again or not because this check can also validate whether row is added to deleted rows treeset or not blocklet.addDeletedRow(CarbonUpdateUtil.getIntegerValue(offset)); so what i do is IsRowAddedForDeletion is true that means deletion is succesfull and if that blocklet is there in blockletdetails simply add the deleted rows to deletedrows treeset in DeleteDeltaBlockletDetails Class even if it is duplicated treeset will not add it but what @sounakr said is correct if it is adding the same row again that means root cause is that somehow split has choosen duplicate blocks which is correct when i further debug the code if found out that in my table i have 15 lakh rows after 1408000 rows same block is picked up again by the splits which is not correct i am looking to debug it more and seeing why it happend --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1660 Build Failed with Spark 2.2.0, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/943/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1660 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2172/ --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/1660 retest this please --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1660 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2203/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1660 Build Failed with Spark 2.2.0, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/980/ --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/1660 retest this please --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1660 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/2214/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1660 Build Success with Spark 2.2.0, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/992/ --- |
In reply to this post by qiuchenjian-2
Github user anubhav100 commented on the issue:
https://github.com/apache/carbondata/pull/1660 we have find the root cause and raised another pr for this issue https://github.com/apache/carbondata/pull/1719 --- |
In reply to this post by qiuchenjian-2
Github user anubhav100 closed the pull request at:
https://github.com/apache/carbondata/pull/1660 --- |
Free forum by Nabble | Edit this page |