Hi all,
Currently, when the fallback is initiated for a column page in case of local dictionary, we are keeping both encoded data and actual data in memory and then we form the new column page without dictionary encoding and then at last we free the Encoded Column Page. Because of this offheap memory footprint increases. We can reduce the offheap memory footprint. This can be done using decoder based fallback mechanism. This means, no need to keep the actual data along with encoded data in encoded column page. We can keep only encoded data and to form a new column page, get the dictionary data from encoded column page by uncompressing and using dictionary data get the actual data using local dictionary generator and put it in new column page created and compress it again and give to consumer for writing blocklet. The above process may slow down the loading, but it will reduces the memory footprint. So we can give a property which will decide whether to take current fallback procedure or decoder based fallback mechanism dring fallback. Any inputs or suggestions are welcomed. Regards, Akash |
This means, no need to keep the actual data along with encoded data in
encoded column page. --- A problem is that, currently index datamap needs the actual data to generate index. You may affect this procedure if you do not keep the actual data. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
+1
@ xuchuanyin This will not impact data map writing flow as actual column page will be cleared only after consuming all the records by data map writer, there will not be any change in that area. -Regards Kumar Vishal , On Mon, Aug 27, 2018 at 1:01 PM xuchuanyin <[hidden email]> wrote: > This means, no need to keep the actual data along with encoded data in > encoded column page. > --- > A problem is that, currently index datamap needs the actual data to > generate > index. You may affect this procedure if you do not keep the actual data. > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
kumar vishal
|
+1
@Akash..I suggest not to expose any property to the user for this. The design should support this decision based on the property but to expose it to the end user, this decision can be taken once you complete your performance testing. Regards Manish Gupta On Mon, 27 Aug 2018 at 1:57 PM, Kumar Vishal <[hidden email]> wrote: > +1 > @ xuchuanyin > This will not impact data map writing flow as actual column page will be > cleared only after consuming all the records by data map writer, > there will not be any change in that area. > > -Regards > Kumar Vishal > , > > On Mon, Aug 27, 2018 at 1:01 PM xuchuanyin <[hidden email]> wrote: > > > This means, no need to keep the actual data along with encoded data in > > encoded column page. > > --- > > A problem is that, currently index datamap needs the actual data to > > generate > > index. You may affect this procedure if you do not keep the actual data. > > > > > > > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > |
As of now i will code as user property, and we can take desicion once we get
the performance report with this. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by akashrn5
Hi all,
With PR https://github.com/apache/carbondata/pull/2662 i have tested the performance and memory requirement with decoder based fallback for local dictionary and the results are as below 1. with current implementation, data loading of 3million data was taking around 4GB when local dictionary was enabled which is almost 10times the memory required to load same data when local dictionary is disabled. With decoder based fall back, the memory requirement is reduced from 10times to almost 2 times. 2. The dataloading performance is as below. With the current implementation, the data loading of 1 billlion data takes around 1.1hrs and with decoder based fallback it takes 1.2hrs, which is not much difference, but memory requirement is reduced more. I think this PR will help. Consolidated points. 1. store size didn't get impacted 2. GC time didn't get impacted 3. Time impact is low as mentioned above 4. memory requirement reduced to higher level Regards, Akash R Nilugal On Mon, Aug 27, 2018 at 11:51 AM Akash Nilugal <[hidden email]> wrote: > Hi all, > > Currently, when the fallback is initiated for a column page in case of > local dictionary, we are keeping both encoded data > and actual data in memory and then we form the new column page without > dictionary encoding and then at last we free the Encoded Column Page. > Because of this offheap memory footprint increases. > > We can reduce the offheap memory footprint. This can be done using decoder > based fallback mechanism. > This means, no need to keep the actual data along with encoded data in > encoded column page. We can keep only encoded data and to form a new column > page, get the dictionary data from encoded column page by uncompressing and > using dictionary data get the actual data using local dictionary generator > and put it in new column page created and compress it again and give to > consumer for writing blocklet. > > The above process may slow down the loading, but it will reduces the > memory footprint. So we can give a property which will decide whether to > take current fallback procedure or decoder based fallback mechanism dring > fallback. > Any inputs or suggestions are welcomed. > > > Regards, > Akash > |
Free forum by Nabble | Edit this page |