Apache CarbonData Dev Mailing List archive - Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Apache CarbonData Dev Mailing List archive

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Posted by hexiaoqiao on Nov 25, 2016; 5:52pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improvement-Use-Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary-tp3132p3216.html

Hi Liang, Kumar Vishal,

I has done a standard benchmark about multiply data structures for
Dictionary following your suggestions. Based on the test results, I think
DAT may be the best choice for CarbonData.

*1. Here are 2 test results:*
-----------------------------------------------------------------------
Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
HashMap : java.util.HashMap
DAT (Double Array Trie):
https://github.com/komiya-atsushi/darts-java
RadixTree:
https://github.com/npgall/concurrent-trees
TrieDict (Dictionary in Kylin):
http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
Dictionary Source (Traditional Chinese):
https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big
================Test Result================
a. Dictionary Size:584429
--------
b. Build Time (ms) :
DAT : 5714
HashMap : 110
RadixTree : 22044
TrieDict : 855
--------
c. Memory footprint in 64-bit JVM (bytes) :
DAT : 16779752
HashMap : 32196592
RadixTree : 46130584
TrieDict : 10443608
--------
d. Retrieval Performance for 9935293 query times (ms) :
DAT : 585
HashMap : 1010
RadixTree : 417639
TrieDict : 8664
================Test Result================

================Test Result================
a. Dictionary Size:584429
--------
b. Build Time (ms) :
DAT : 5867
HashMap : 100
RadixTree : 22082
TrieDict : 840
--------
c. Memory footprint in 64-bit JVM (bytes) :
DAT : 16779752
HashMap : 32196592
RadixTree : 46130584
TrieDict : 10443608
--------
d. Retrieval Performance for 9935293 query times (ms) :
DAT : 593
HashMap : 821
RadixTree : 422297
TrieDict : 8752
================Test Result================

*2. Conclusion:*
a. TrieDict is good for building tree and less memory footprint overhead,
but worst retrieval performance,
b. DAT is a good tradeoff between memory footprint and retrieval
performance,
c. RadixTree has the worst performance in different aspects.

*3. Result Analysis:*
a. With Trie the memory footprint of the TrieDict mapping is kinda
minimized if compared to HashMap, in order to improve performance there is
a cache layer overlays on top of Trie.
b. Because a large number of duplicate prefix data, the total memory
footprint is more than trie, meanwhile i think calculating string hash code
of traditional Chinese consume considerable time overhead, so the
performance is not the best.
c. DAT is a better tradeoff.
d. I have no idea why RadixTree has the worst performance in terms of
memory, retrieval and building tree.

On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[hidden email]>
wrote:

> Hi xiaoqiao
>
> ok, look forward to seeing your test result.
> Can you take this task for this improvement? Please let me know if you need
> any support :)
>
> Regards
> Liang
>
>
> hexiaoqiao wrote
> > Hi Kumar Vishal,
> >
> > Thanks for your suggestions. As you said, choose Trie replace HashMap we
> > can get better memory footprint and also good performance. Of course, DAT
> > is not only choice, and I will do test about DAT vs Radix Trie and
> release
> > the test result as soon as possible. Thanks your suggestions again.
> >
> > Regards,
> > Xiaoqiao
> >
> > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal <
>
> > kumarvishal1802@
>
> > >
> > wrote:
> >
> >> Hi XIaoqiao He,
> >> +1,
> >> For forward dictionary case it will be very good optimisation, as our
> >> case
> >> is very specific storing byte array to int mapping[data to surrogate key
> >> mapping], I think we will get much better memory footprint and
> >> performance
> >> will be also good(2x). We can also try radix tree(radix trie), it is
> more
> >> optimise for storage.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <
>
> > chenliang6136@
>
> > >
> >> wrote:
> >>
> >> > Hi xiaoqiao
> >> >
> >> > For the below example, 600K dictionary data:
> >> > It is to say that using "DAT" can save 36M memory against
> >> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >> >
> >> > One more question:if increases the dictionary data size, what's the
> >> > comparison results "ConcurrentHashMap" VS "DAT"
> >> >
> >> > Regards
> >> > Liang
> >> > ------------------------------------------------------------
> >> > ------------------------------------------
> >> > a. memory footprint (approximate quantity) in 64-bit JVM:
> >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >> >
> >> > b. retrieval performance: total time(ms) of 500 million query:
> >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >> >
> >> > Regards
> >> > Liang
> >> >
> >> > hexiaoqiao wrote
> >> > > hi Liang,
> >> > >
> >> > > Thanks for your reply, i need to correct the experiment result
> >> because
> >> > > it's
> >> > > wrong order NO.1 column of result data table.
> >> > >
> >> > > In order to compare performance between Trie and HashMap, Two
> >> different
> >> > > structures are constructed using the same dictionary data which size
> >> is
> >> > > 600K and each item's length is between 2 and 50 bytes.
> >> > >
> >> > > ConcurrentHashMap (structure which is used in CarbonData currently)
> >> vs
> >> > > Double
> >> > > Array Trie (one implementation of Trie Structures)
> >> > >
> >> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >> > >
> >> > > b. retrieval performance: total time(ms) of 500 million query:
> >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >> > >
> >> > > Regards,
> >> > > He Xiaoqiao
> >> > >
> >> > >
> >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <
> >> >
> >> > > chenliang6136@
> >> >
> >> > > > wrote:
> >> > >
> >> > >> Hi xiaoqiao
> >> > >>
> >> > >> This improvement looks great!
> >> > >> Can you please explain the below data, what does it mean?
> >> > >> ----------
> >> > >> ConcurrentHashMap
> >> > >> ~68MB 14543
> >> > >> Double Array Trie
> >> > >> ~104MB 12825
> >> > >>
> >> > >> Regards
> >> > >> Liang
> >> > >>
> >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <
> >> >
> >> > > xq.he2009@
> >> >
> >> > > >:
> >> > >>
> >> > >> > Hi All,
> >> > >> >
> >> > >> > I would like to propose Dictionary improvement which using Trie
> in
> >> > >> place
> >> > >> of
> >> > >> > HashMap.
> >> > >> >
> >> > >> > In order to speedup aggregation, reduce run-time memory
> footprint,
> >> > >> enable
> >> > >> > fast
> >> > >> > distinct count etc, CarbonData encodes data using dictionary at
> >> file
> >> > >> level
> >> > >> > or table level based on cardinality. It is a general and
> efficient
> >> way
> >> > >> in
> >> > >> > many big data systems, but when apply ConcurrentHashMap
> >> > >> > to maintain Dictionary in CarbonData currently, memory overhead
> of
> >> > >> > Driver is very huge since it has to load whole Dictionary to
> >> decode
> >> > >> actual
> >> > >> > data value, especially column cardinality is a large number. and
> >> > >> CarbonData
> >> > >> > will not do dictionary if cardinality > 1 million at default
> >> behavior.
> >> > >> >
> >> > >> > I propose using Trie in place of HashMap for the following three
> >> > >> reasons:
> >> > >> > (1) Trie is a proper structure for Dictionary,
> >> > >> > (2) Reduce memory footprint,
> >> > >> > (3) Not impact retrieval performance
> >> > >> >
> >> > >> > The experimental results show that Trie is able to meet the
> >> > >> requirement.
> >> > >> > a. ConcurrentHashMap vs Double Array Trie
> >> > >> > <https://linux.thai.net/~thep/datrie/datrie.html>(one
> >> > >> implementation of
> >> > >> > Trie Structures)
> >> > >> > b. Dictionary size: 600K
> >> > >> > c. Memory footprint and query time
> >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > >> > ConcurrentHashMap
> >> > >> > ~68MB 14543
> >> > >> > Double Array Trie
> >> > >> > ~104MB 12825
> >> > >> >
> >> > >> > Please share your suggestions about the proposed improvement of
> >> > >> Dictionary.
> >> > >> >
> >> > >> > Regards
> >> > >> > He Xiaoqiao
> >> > >> >
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Regards
> >> > >> Liang
> >> > >>
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://apache-carbondata-
> >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> >> > Dictionary-tp3132p3143.html
> >> > Sent from the Apache CarbonData Mailing List archive mailing list
> >> archive
> >> > at Nabble.com.
> >> >
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> Dictionary-tp3132p3186.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>