Hi all,
I go through the code and get another formula to estimate the unsafe working memory. It is inaccurate too but we can open this thread to optimize it. # Memory Required For Data Loading per Table ## version from Community (carbon.number.of.cores.while.loading) * (offheap.sort.chunk.size.inmb + carbon.blockletgroup.size.in.mb + carbon.blockletgroup.size.in.mb/3.5 ) ## version from proposal memory_size_reqiured = max(sort_temp_memory_consumption, data_encoding_consumption) = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores * (TABLE_PAGE_SIZE)} = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores * (number.of.fields * per.column.page.size + compress.temp.size)} = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores * (number.of.fields * per.column.page.size + per.column.page.size/3.5)} = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores * (number.of.fields * (32000 * 8 * 1.25) + (32000 * 8 * 1.25)/3.5)} Note: 1. offheap.sort.chunk.size.inmb is the size for UnsafeCarbonRowPage 2. column.page.size is the size for ColumnPage 3. compress.temp.size is for temporay size for compressing using snappy (in UnsafeFixLengthColumnPage.compress) ## problems of each version 1. both do not consider the local dictionary which is disabled by default; 2. both do not consider the in-memory intermediate merge which is disabled by default; ### for Community version 1. For per loading, the sort-temp procedure finished before the producer-consumer procedure, so we do not need to accumulate them. 2. During loading in the producer-consumer procedure, #numer.of.cores TablePages will be generated, this may surpass the #carbon.blockletgroup.size.in.mb, so just use #carbon.blockletgroup.size.in.mb may also cause memory shortage especially when #numer.of.cores TablePages is high. ### for proposed version 1. It roughly uses 8 bytes * 1.25 (factor in our code) to represent a value size, which is inaccurate. Besides, 32000 is the max record number in one page especially after adptive page size for longstring and complex is implemented. 2. We can further decomposite the #per.column.page.size by identifying the datatype and data length for string columns, but this may be too trivial for user to calculate. We can also run the data loading once and get the #TABLE_PAGE_SIZE or #per.column.page.size, this should be accurate. ## for example number.of.cores = 15 offheap.sort.chunk.size.inmb = 64 number.of.fields = 300 ### Community version memory_size_reqiured = 15 * (64MB + 64MB + 64MB/3.5) = 2194MB ### proposed version memory_size_reqiured = max{(15 + 1) * 64MB, 15 * (330 * (32000 * 8 * 1.25) + 32000 * 8 * 1.25 / 3.5)} = {1073741824, 15 * 108228023} = max{1073741824, 1623420343} = 1548MB -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
hi chuanyin, can you help to answer this question?
I think the formula maybe is not accurate。 And how many factors affect unsafe working memory? Looking forward your reply, thanks! ------------------ 原始邮件 ------------------ 发件人: "251922566"<[hidden email]>; 发送时间: 2018年11月30日(星期五) 下午3:22 收件人: "dev"<[hidden email]>; 主题: Re: [Discussion] How to configure the unsafe working memory for dataloading hi chuanyin, I found that this formula maybe is not correct. when I do loading, I set spark.yarn.executor.memoryOverhead = 5120, and set these carbon properties as below: carbon.number.of.cores.while.loading=5 carbon.lock.type=HDFSLOCK enable.unsafe.sort=true offheap.sort.chunk.size.inmb=64 sort.inmemory.size.inmb=4096 carbon.enable.vector.reader=true enable.unsafe.in.query.processing=true carbon.blockletgroup.size.in.mb=64 enable.unsafe.columnpage=true carbon.unsafe.working.memory.in.mb=4096 But it still report unsafe memory is not enough. But as formula from community, it only need 5 * (64MB + 64MB + 64MB/3.5) = 732MB. if as formula you proposed , it only needs = max{(5 + 1) * 64MB, 5 * (330 * (32000 * 8 * 1.25) + 32000 * 8 * 1.25 / 3.5)} = 530MB ps:I have about 300 fields. Spark version:2.2.1 carbon version:apache-carbondata-1.4.1-bin-spark2.2.1-hadoop2.7.2 Looking forward your reply. ------------------ Original ------------------ From: "xuchuanyin"<[hidden email]>; Date: Tue, Oct 23, 2018 06:55 PM To: "dev"<[hidden email]>; Subject: [Discussion] How to configure the unsafe working memory for dataloading Hi all, I go through the code and get another formula to estimate the unsafe working memory. It is inaccurate too but we can open this thread to optimize it. # Memory Required For Data Loading per Table ## version from Community (carbon.number.of.cores.while.loading) * (offheap.sort.chunk.size.inmb + carbon.blockletgroup.size.in.mb + carbon.blockletgroup.size.in.mb/3.5 ) ## version from proposal memory_size_reqiured = max(sort_temp_memory_consumption, data_encoding_consumption) = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores * (TABLE_PAGE_SIZE)} = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores * (number.of.fields * per.column.page.size + compress.temp.size)} = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores * (number.of.fields * per.column.page.size + per.column.page.size/3.5)} = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores * (number.of.fields * (32000 * 8 * 1.25) + (32000 * 8 * 1.25)/3.5)} Note: 1. offheap.sort.chunk.size.inmb is the size for UnsafeCarbonRowPage 2. column.page.size is the size for ColumnPage 3. compress.temp.size is for temporay size for compressing using snappy (in UnsafeFixLengthColumnPage.compress) ## problems of each version 1. both do not consider the local dictionary which is disabled by default; 2. both do not consider the in-memory intermediate merge which is disabled by default; ### for Community version 1. For per loading, the sort-temp procedure finished before the producer-consumer procedure, so we do not need to accumulate them. 2. During loading in the producer-consumer procedure, #numer.of.cores TablePages will be generated, this may surpass the #carbon.blockletgroup.size.in.mb, so just use #carbon.blockletgroup.size.in.mb may also cause memory shortage especially when #numer.of.cores TablePages is high. ### for proposed version 1. It roughly uses 8 bytes * 1.25 (factor in our code) to represent a value size, which is inaccurate. Besides, 32000 is the max record number in one page especially after adptive page size for longstring and complex is implemented. 2. We can further decomposite the #per.column.page.size by identifying the datatype and data length for string columns, but this may be too trivial for user to calculate. We can also run the data loading once and get the #TABLE_PAGE_SIZE or #per.column.page.size, this should be accurate. ## for example number.of.cores = 15 offheap.sort.chunk.size.inmb = 64 number.of.fields = 300 ### Community version memory_size_reqiured = 15 * (64MB + 64MB + 64MB/3.5) = 2194MB ### proposed version memory_size_reqiured = max{(15 + 1) * 64MB, 15 * (330 * (32000 * 8 * 1.25) + 32000 * 8 * 1.25 / 3.5)} = {1073741824, 15 * 108228023} = max{1073741824, 1623420343} = 1548MB -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi, What's the number of cores in your executor?
And is there only one loading while you encounter this failure? Besides, can you check if the local dictionary is enabled for your table using 'desc formatter table_name'? If it is enabled, more memory will be needed and the provided formula does not consider this. So you can try to set 'carbon.local.dictionary.decoder.fallback' to false and try again. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |