This post was updated on .
Hi,
I am using CarbonData1.3 + Spark2.1,My code is: val df = carbonSession.sql(“select * from t where name like ‘aaa%’”) df.coalesce(n).write.saveAsTable(“r”) // you can set n=1 to reproduce this issue The job aborted with oom error. I analyze the dump and find that there are hundreds of DimensionRawColumnChunk object, each object occupies 50M memory as the screenshot shows. <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t174/Screenshot-1.png> I investigate the source code of CarbonScanRDD and find out the root cause of this issue is related to this code snippet: context.addTaskCompletionListener{ _=> reader.close() close() } TaskContext object holds reader’s reference until the task finished and coalesce combines a lot of CarbonSparkPartition into one task.My proposal for this issue is : (1) Explicitly set some object as null when it is not used so that it will be released as early as possible. For example, In DimensionRawColumnChunk freeMemory function set rawData=null; I made a test and this really works. (2)TaskContext object should not always hold reader’s reference or it should not hold so many readers. Currently, I have no idea how to implement this. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi,
Good catch! I think proposal 1 is ok, please feel free to open jira ticket and submit PR. Let the CI and SDV test suite to run and see whether it is ok. Regards, Jacky > 在 2018年3月31日,上午11:35,yaojinguo <[hidden email]> 写道: > > Hi, > I am using CarbonData1.3 + Spark2.1,My code is: > val df = carbonSession.sql(“select * from t where name like ‘aaa%’”) > df.coalesce(n).write.saveAsTable(“r”) // you can set n=1 to reproduce > this issue > The job aborted with omm error. I analyze the dump and find that there are > hundreds of DimensionRawColumnChunk object, each object occupies 50M memory > as the screenshot shows. > <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t174/Screenshot-1.png> > I investigate the source code of CarbonScanRDD and find out the root > cause of this issue is related to this code snippet: > context.addTaskCompletionListener{ _=> > reader.close() > close() > } > TaskContext object holds reader’s reference until the task finished and > coalesce combines a lot of CarbonSparkPartition into one task.My proposal > for this issue is : > (1) Explicitly set some object as null when it is not used so that it will > be released as early as possible. For example, In DimensionRawColumnChunk > freeMemory function set rawData=null; I made a test and this really works. > (2)TaskContext object should not always hold reader’s reference or it > should not hold so many readers. Currently, I have no idea how to implement > this. > > > > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
for proposal 2, we can borrow the idea from spark source code. In
RecordReaderIterator class, it close the reader in hasNext function, this helps to release the resource early. I am not sure if I can remove context.addTaskCompletionListener from CarbonScanRDD without other problems. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |