[jira] [Commented] (CARBONDATA-3626) Improve performance when load data into carbondata

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CARBONDATA-3626) Improve performance when load data into carbondata

Akash R Nilugal (Jira)

    [ https://issues.apache.org/jira/browse/CARBONDATA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004372#comment-17004372 ]

Jacky Li commented on CARBONDATA-3626:
--------------------------------------

Thanks for reporting this issue

> Improve performance when load data into carbondata
> --------------------------------------------------
>
>                 Key: CARBONDATA-3626
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-3626
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: spark-integration
>            Reporter: Hong Shen
>            Priority: Major
>         Attachments: image-2019-12-21-21-20-19-603.png, screenshot-1.png
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I prepare to use carbondata improve sparksql in our company, but I often found it's take a long time when load data when the carbon table has many fields.
> {code}
> carbon.sql("insert into TABLE table2  select * from table1")
> {code}
> For example, when I use a production table2 with more than 100 columns, When the above sql is running, one task take 10min to load 200MB data(with snappy compress), the log is
> {code}
> 2019-12-21 17:31:29 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37975 is: 110
> 2019-12-21 17:31:35 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37978 is: 64
> 2019-12-21 17:31:42 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37977 is: 64
> 2019-12-21 17:31:48 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37972 is: 66
> 2019-12-21 17:31:54 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37979 is: 68
> 2019-12-21 17:32:00 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37978 is: 62
> 2019-12-21 17:32:07 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37981 is: 65
> 2019-12-21 17:32:13 INFO  UnsafeSortDataRows:395 - Time taken to sort row page with size37972 and write is: 226: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949613867659265.sorttemp, sort temp file size in MB is 5.350312232971191
> 2019-12-21 17:32:19 INFO  UnsafeSortDataRows:395 - Time taken to sort row page with size37982 and write is: 172: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949620209578293.sorttemp, sort temp file size in MB is 5.293270111083984
> 2019-12-21 17:32:26 INFO  UnsafeSortDataRows:395 - Time taken to sort row page with size37974 and write is: 175: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949626542521877.sorttemp, sort temp file size in MB is 5.349262237548828
> ... ...
> {code}
> The task's jstack is often like below:
> {code}
> "Executor task launch worker for task 164" #77 daemon prio=5 os_prio=0 tid=0x00002ab5768c3800 nid=0xb895 runnable [0x00002ab578afd000]
>    java.lang.Thread.State: RUNNABLE
>         at scala.collection.LinearSeqOptimized$class.length(LinearSeqOptimized.scala:54)
>         at scala.collection.immutable.List.length(List.scala:84)
>         at org.apache.spark.sql.execution.datasources.CarbonOutputWriter.writeCarbon(SparkCarbonTableFormat.scala:360)
>         at org.apache.spark.sql.execution.datasources.AbstractCarbonOutputWriter$class.write(SparkCarbonTableFormat.scala:234)
>         at org.apache.spark.sql.execution.datasources.CarbonOutputWriter.write(SparkCarbonTableFormat.scala:239)
>         at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$7.apply(FileFormatWriter.scala:717)
>         at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$7.apply(FileFormatWriter.scala:661)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>         at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:661)
>         at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:334)
>         at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:332)
>         at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1418)
>         at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:337)
>         at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:215)
>         at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:214)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>         at org.apache.spark.scheduler.Task.run(Task.scala:109)
>         at org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:379)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:360)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1787)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:376)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:621)
>         at java.lang.Thread.run(Thread.java:849)
> {code}
> The code is:
>  !screenshot-1.png!
>  !image-2019-12-21-21-20-19-603.png!
> It because fieldTypes.length will take a long time when the table has many fields.
> When I edit the code to below, the writeCarbon() time will change from 7s to 1s.
> {code}
>   def writeCarbon(row: InternalRow): Unit = {
>     val data = new Array[AnyRef](fieldTypes.length + partitionData.length)
>     var i = 0
>     val fieldTypesLen = fieldTypes.length
>     while (i < fieldTypesLen) {
>       if (!row.isNullAt(i)) {
>         fieldTypes(i) match {
>           case StringType =>
>             data(i) = row.getString(i)
>           case d: DecimalType =>
>             data(i) = row.getDecimal(i, d.precision, d.scale).toJavaBigDecimal
>           case other =>
>             data(i) = row.get(i, other)
>         }
>       }
>       i += 1
>     }
> ......
> {code}
> Here is the new log:
> {code}
> 2019-12-21 20:28:43 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37973 is: 78
> 2019-12-21 20:28:44 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37979 is: 48
> 2019-12-21 20:28:45 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37977 is: 45
> 2019-12-21 20:28:47 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37980 is: 45
> 2019-12-21 20:28:48 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37977 is: 45
> 2019-12-21 20:28:49 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37977 is: 44
> 2019-12-21 20:28:50 INFO  UnsafeSortDataRows:416 - Time taken to sort row page with size: 37976 is: 44
> 2019-12-21 20:28:52 INFO  UnsafeSortDataRows:395 - Time taken to sort row page with size37977 and write is: 166: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365393348305122.sorttemp, sort temp file size in MB is 5.342463493347168
> 2019-12-21 20:28:53 INFO  UnsafeSortDataRows:395 - Time taken to sort row page with size37981 and write is: 134: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365394590239651.sorttemp, sort temp file size in MB is 5.291025161743165
> 2019-12-21 20:28:54 INFO  UnsafeSortDataRows:395 - Time taken to sort row page with size37973 and write is: 131: location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365395807353135.sorttemp, sort temp file size in MB is 5.34185791015625
> {code}
> I will add a patch to improve it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)