Memory leak issue when using DataFrame.coalesce

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Memory leak issue when using DataFrame.coalesce

yaojinguo
This post was updated on .
Hi,
   I am using CarbonData1.3 + Spark2.1,My code is:
    val df = carbonSession.sql(“select * from t where name like ‘aaa%’”)
    df.coalesce(n).write.saveAsTable(“r”) // you can set n=1 to reproduce
this issue
  The job aborted with oom error. I analyze the dump and find that there are
hundreds of DimensionRawColumnChunk object, each object occupies 50M memory
as the screenshot shows.
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t174/Screenshot-1.png
   I investigate the source code of CarbonScanRDD and find out the root
cause of this issue is related to this code snippet:
     context.addTaskCompletionListener{ _=>
        reader.close()
         close()
    }
   TaskContext object holds reader’s reference until the task finished and
coalesce combines a lot of CarbonSparkPartition into one task.My proposal
for this issue is :
  (1) Explicitly set some object as null when it is not used so that it will
be released as early as possible. For example, In DimensionRawColumnChunk
freeMemory function set rawData=null; I made a test and this really works.
  (2)TaskContext object should not always hold reader’s reference or it
should not hold so many readers. Currently, I have no idea how to implement
this.






--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Memory leak issue when using DataFrame.coalesce

Jacky Li
Hi,

Good catch!
I think proposal 1 is ok, please feel free to open jira ticket and submit PR. Let the CI and SDV test suite to run and see whether it is ok.

Regards,
Jacky

> 在 2018年3月31日,上午11:35,yaojinguo <[hidden email]> 写道:
>
> Hi,
>   I am using CarbonData1.3 + Spark2.1,My code is:
>    val df = carbonSession.sql(“select * from t where name like ‘aaa%’”)
>    df.coalesce(n).write.saveAsTable(“r”) // you can set n=1 to reproduce
> this issue
>  The job aborted with omm error. I analyze the dump and find that there are
> hundreds of DimensionRawColumnChunk object, each object occupies 50M memory
> as the screenshot shows.
> <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t174/Screenshot-1.png>
>   I investigate the source code of CarbonScanRDD and find out the root
> cause of this issue is related to this code snippet:
>     context.addTaskCompletionListener{ _=>
> reader.close()
> close()
>    }
>   TaskContext object holds reader’s reference until the task finished and
> coalesce combines a lot of CarbonSparkPartition into one task.My proposal
> for this issue is :
>  (1) Explicitly set some object as null when it is not used so that it will
> be released as early as possible. For example, In DimensionRawColumnChunk
> freeMemory function set rawData=null; I made a test and this really works.
>  (2)TaskContext object should not always hold reader’s reference or it
> should not hold so many readers. Currently, I have no idea how to implement
> this.
>
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Memory leak issue when using DataFrame.coalesce

yaojinguo
   for proposal 2, we can borrow the idea from spark source code. In
RecordReaderIterator class, it close the reader in hasNext function, this
helps to release the resource early.
   I am not sure if I can remove context.addTaskCompletionListener from
CarbonScanRDD without other problems.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/