Login  Register

Memory leak issue when using DataFrame.coalesce

classic Classic list List threaded Threaded
3 messages Options Options
Embed post
Permalink
Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Memory leak issue when using DataFrame.coalesce

yaojinguo
6 posts
This post was updated on Mar 31, 2018; 3:38am.
Hi,
   I am using CarbonData1.3 + Spark2.1,My code is:
    val df = carbonSession.sql(“select * from t where name like ‘aaa%’”)
    df.coalesce(n).write.saveAsTable(“r”) // you can set n=1 to reproduce
this issue
  The job aborted with oom error. I analyze the dump and find that there are
hundreds of DimensionRawColumnChunk object, each object occupies 50M memory
as the screenshot shows.
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t174/Screenshot-1.png
   I investigate the source code of CarbonScanRDD and find out the root
cause of this issue is related to this code snippet:
     context.addTaskCompletionListener{ _=>
        reader.close()
         close()
    }
   TaskContext object holds reader’s reference until the task finished and
coalesce combines a lot of CarbonSparkPartition into one task.My proposal
for this issue is :
  (1) Explicitly set some object as null when it is not used so that it will
be released as early as possible. For example, In DimensionRawColumnChunk
freeMemory function set rawData=null; I made a test and this really works.
  (2)TaskContext object should not always hold reader’s reference or it
should not hold so many readers. Currently, I have no idea how to implement
this.






--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: Memory leak issue when using DataFrame.coalesce

Jacky Li
228 posts
Hi,

Good catch!
I think proposal 1 is ok, please feel free to open jira ticket and submit PR. Let the CI and SDV test suite to run and see whether it is ok.

Regards,
Jacky

> 在 2018年3月31日,上午11:35,yaojinguo <[hidden email]> 写道:
>
> Hi,
>   I am using CarbonData1.3 + Spark2.1,My code is:
>    val df = carbonSession.sql(“select * from t where name like ‘aaa%’”)
>    df.coalesce(n).write.saveAsTable(“r”) // you can set n=1 to reproduce
> this issue
>  The job aborted with omm error. I analyze the dump and find that there are
> hundreds of DimensionRawColumnChunk object, each object occupies 50M memory
> as the screenshot shows.
> <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t174/Screenshot-1.png>
>   I investigate the source code of CarbonScanRDD and find out the root
> cause of this issue is related to this code snippet:
>     context.addTaskCompletionListener{ _=>
> reader.close()
> close()
>    }
>   TaskContext object holds reader’s reference until the task finished and
> coalesce combines a lot of CarbonSparkPartition into one task.My proposal
> for this issue is :
>  (1) Explicitly set some object as null when it is not used so that it will
> be released as early as possible. For example, In DimensionRawColumnChunk
> freeMemory function set rawData=null; I made a test and this really works.
>  (2)TaskContext object should not always hold reader’s reference or it
> should not hold so many readers. Currently, I have no idea how to implement
> this.
>
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: Memory leak issue when using DataFrame.coalesce

yaojinguo
6 posts
   for proposal 2, we can borrow the idea from spark source code. In
RecordReaderIterator class, it close the reader in hasNext function, this
helps to release the resource early.
   I am not sure if I can remove context.addTaskCompletionListener from
CarbonScanRDD without other problems.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/