[ https://issues.apache.org/jira/browse/CARBONDATA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacky Li resolved CARBONDATA-3626. ---------------------------------- Fix Version/s: 2.0.0 Resolution: Fixed > Improve performance when load data into carbondata > -------------------------------------------------- > > Key: CARBONDATA-3626 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3626 > Project: CarbonData > Issue Type: Improvement > Components: spark-integration > Reporter: Hong Shen > Priority: Major > Fix For: 2.0.0 > > Attachments: image-2019-12-21-21-20-19-603.png, screenshot-1.png > > Time Spent: 1.5h > Remaining Estimate: 0h > > I prepare to use carbondata improve sparksql in our company, but I often found it's take a long time when load data when the carbon table has many fields. > {code} > carbon.sql("insert into TABLE table2 select * from table1") > {code} > For example, when I use a production table2 with more than 100 columns, When the above sql is running, one task take 10min to load 200MB data(with snappy compress), the log is > {code} > 2019-12-21 17:31:29 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37975 is: 110 > 2019-12-21 17:31:35 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37978 is: 64 > 2019-12-21 17:31:42 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37977 is: 64 > 2019-12-21 17:31:48 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37972 is: 66 > 2019-12-21 17:31:54 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37979 is: 68 > 2019-12-21 17:32:00 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37978 is: 62 > 2019-12-21 17:32:07 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37981 is: 65 > 2019-12-21 17:32:13 INFO UnsafeSortDataRows:395 - Time taken to sort row page with size37972 and write is: 226: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949613867659265.sorttemp, sort temp file size in MB is 5.350312232971191 > 2019-12-21 17:32:19 INFO UnsafeSortDataRows:395 - Time taken to sort row page with size37982 and write is: 172: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949620209578293.sorttemp, sort temp file size in MB is 5.293270111083984 > 2019-12-21 17:32:26 INFO UnsafeSortDataRows:395 - Time taken to sort row page with size37974 and write is: 175: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949626542521877.sorttemp, sort temp file size in MB is 5.349262237548828 > ... ... > {code} > The task's jstack is often like below: > {code} > "Executor task launch worker for task 164" #77 daemon prio=5 os_prio=0 tid=0x00002ab5768c3800 nid=0xb895 runnable [0x00002ab578afd000] > java.lang.Thread.State: RUNNABLE > at scala.collection.LinearSeqOptimized$class.length(LinearSeqOptimized.scala:54) > at scala.collection.immutable.List.length(List.scala:84) > at org.apache.spark.sql.execution.datasources.CarbonOutputWriter.writeCarbon(SparkCarbonTableFormat.scala:360) > at org.apache.spark.sql.execution.datasources.AbstractCarbonOutputWriter$class.write(SparkCarbonTableFormat.scala:234) > at org.apache.spark.sql.execution.datasources.CarbonOutputWriter.write(SparkCarbonTableFormat.scala:239) > at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$7.apply(FileFormatWriter.scala:717) > at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$7.apply(FileFormatWriter.scala:661) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:661) > at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:334) > at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:332) > at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1418) > at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:337) > at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:215) > at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:214) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:379) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1787) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:376) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:621) > at java.lang.Thread.run(Thread.java:849) > {code} > The code is: > !screenshot-1.png! > !image-2019-12-21-21-20-19-603.png! > It because fieldTypes.length will take a long time when the table has many fields. > When I edit the code to below, the writeCarbon() time will change from 7s to 1s. > {code} > def writeCarbon(row: InternalRow): Unit = { > val data = new Array[AnyRef](fieldTypes.length + partitionData.length) > var i = 0 > val fieldTypesLen = fieldTypes.length > while (i < fieldTypesLen) { > if (!row.isNullAt(i)) { > fieldTypes(i) match { > case StringType => > data(i) = row.getString(i) > case d: DecimalType => > data(i) = row.getDecimal(i, d.precision, d.scale).toJavaBigDecimal > case other => > data(i) = row.get(i, other) > } > } > i += 1 > } > ...... > {code} > Here is the new log: > {code} > 2019-12-21 20:28:43 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37973 is: 78 > 2019-12-21 20:28:44 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37979 is: 48 > 2019-12-21 20:28:45 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37977 is: 45 > 2019-12-21 20:28:47 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37980 is: 45 > 2019-12-21 20:28:48 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37977 is: 45 > 2019-12-21 20:28:49 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37977 is: 44 > 2019-12-21 20:28:50 INFO UnsafeSortDataRows:416 - Time taken to sort row page with size: 37976 is: 44 > 2019-12-21 20:28:52 INFO UnsafeSortDataRows:395 - Time taken to sort row page with size37977 and write is: 166: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365393348305122.sorttemp, sort temp file size in MB is 5.342463493347168 > 2019-12-21 20:28:53 INFO UnsafeSortDataRows:395 - Time taken to sort row page with size37981 and write is: 134: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365394590239651.sorttemp, sort temp file size in MB is 5.291025161743165 > 2019-12-21 20:28:54 INFO UnsafeSortDataRows:395 - Time taken to sort row page with size37973 and write is: 131: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365395807353135.sorttemp, sort temp file size in MB is 5.34185791015625 > {code} > I will add a patch to improve it. -- This message was sent by Atlassian Jira (v8.3.4#803005) |
Free forum by Nabble | Edit this page |