Apache CarbonData Dev Mailing List archive - Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Apache CarbonData Dev Mailing List archive

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Posted by hexiaoqiao on Nov 24, 2016; 12:37pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improvement-Use-Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary-tp3132p3162.html

Hi Liang,

Generally, yes, because the same prefix of items in dictionary does not
require to repeat in DAT, and more data better result.

Actually the cost of DAT is building Tree, and i don't think we need to
consider it since this cost appears only once when load data.

FYI.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 2:42 PM, Liang Chen <[hidden email]> wrote:

> Hi xiaoqiao
>
> For the below example, 600K dictionary data:
> It is to say that using "DAT" can save 36M memory against
> "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>
> One more question:if increases the dictionary data size, what's the
> comparison results "ConcurrentHashMap" VS "DAT"
>
> Regards
> Liang
> ------------------------------------------------------------
> ------------------------------------------
> a. memory footprint (approximate quantity) in 64-bit JVM:
> ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>
> b. retrieval performance: total time(ms) of 500 million query:
> 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>
> Regards
> Liang
>
> hexiaoqiao wrote
> > hi Liang,
> >
> > Thanks for your reply, i need to correct the experiment result because
> > it's
> > wrong order NO.1 column of result data table.
> >
> > In order to compare performance between Trie and HashMap, Two different
> > structures are constructed using the same dictionary data which size is
> > 600K and each item's length is between 2 and 50 bytes.
> >
> > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > Double
> > Array Trie (one implementation of Trie Structures)
> >
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards,
> > He Xiaoqiao
> >
> >
> > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <
>
> > chenliang6136@
>
> > > wrote:
> >
> >> Hi xiaoqiao
> >>
> >> This improvement looks great!
> >> Can you please explain the below data, what does it mean?
> >> ----------
> >> ConcurrentHashMap
> >> ~68MB 14543
> >> Double Array Trie
> >> ~104MB 12825
> >>
> >> Regards
> >> Liang
> >>
> >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <
>
> > xq.he2009@
>
> > >:
> >>
> >> > Hi All,
> >> >
> >> > I would like to propose Dictionary improvement which using Trie in
> >> place
> >> of
> >> > HashMap.
> >> >
> >> > In order to speedup aggregation, reduce run-time memory footprint,
> >> enable
> >> > fast
> >> > distinct count etc, CarbonData encodes data using dictionary at file
> >> level
> >> > or table level based on cardinality. It is a general and efficient way
> >> in
> >> > many big data systems, but when apply ConcurrentHashMap
> >> > to maintain Dictionary in CarbonData currently, memory overhead of
> >> > Driver is very huge since it has to load whole Dictionary to decode
> >> actual
> >> > data value, especially column cardinality is a large number. and
> >> CarbonData
> >> > will not do dictionary if cardinality > 1 million at default behavior.
> >> >
> >> > I propose using Trie in place of HashMap for the following three
> >> reasons:
> >> > (1) Trie is a proper structure for Dictionary,
> >> > (2) Reduce memory footprint,
> >> > (3) Not impact retrieval performance
> >> >
> >> > The experimental results show that Trie is able to meet the
> >> requirement.
> >> > a. ConcurrentHashMap vs Double Array Trie
> >> > <https://linux.thai.net/~thep/datrie/datrie.html>(one
> >> implementation of
> >> > Trie Structures)
> >> > b. Dictionary size: 600K
> >> > c. Memory footprint and query time
> >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > ConcurrentHashMap
> >> > ~68MB 14543
> >> > Double Array Trie
> >> > ~104MB 12825
> >> >
> >> > Please share your suggestions about the proposed improvement of
> >> Dictionary.
> >> >
> >> > Regards
> >> > He Xiaoqiao
> >> >
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-maili
> ng-list-archive.1130556.n5.nabble.com/Improvement-Use-Trie-
> in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary
> -tp3132p3143.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>