Login  Register

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Posted by hexiaoqiao on Nov 29, 2016; 3:27am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improvement-Use-Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary-tp3132p3327.html

Hi Jihong,

Thanks for your attentions and reply.
1. Actually I has done benchmark with English/Chinese dictionary size in
{100K,200K,300K,400K,500K,600K} separately, and test result is basic same
as mentioned in this mail flow before, I will submit and open the benchmark
code and dictionary source to github
<https://github.com/Hexiaoqiao/bigdata_algorithm_benchmark> as soon as
possible.
2. I also notice the license of DAT
<https://github.com/komiya-atsushi/darts-java>, and i think it's necessary
to re-implement another DAT following this paper:
https://linux.thai.net/~thep/datrie/datrie.html.

All kinds of suggestions are welcomed.

Regards,
He Xiaoqiao


On Tue, Nov 29, 2016 at 5:17 AM, Jihong Ma <[hidden email]> wrote:

> Thank you Xiaoqiao for looking into this issue and sharing your result!
>
> Have you tried varied dictionary size for comparison among all the
> alternatives?
>
> And please pay closer attention to the license of DAT implementation, as
> they are under LGPL, generally speaking, it is not legally allowed to be
> included.
>
> Jihong
>
> -----Original Message-----
> From: Xiaoqiao He [mailto:[hidden email]]
> Sent: Friday, November 25, 2016 9:52 AM
> To: [hidden email]
> Subject: Re: [Improvement] Use Trie in place of HashMap to reduce memory
> footprint of Dictionary
>
> Hi Liang, Kumar Vishal,
>
> I has done a standard benchmark about multiply data structures for
> Dictionary following your suggestions. Based on the test results, I think
> DAT may be the best choice for CarbonData.
>
> *1. Here are 2 test results:*
> -----------------------------------------------------------------------
> Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
>   HashMap :                               java.util.HashMap
>   DAT (Double Array Trie):
> https://github.com/komiya-atsushi/darts-java
>   RadixTree:
> https://github.com/npgall/concurrent-trees
>   TrieDict (Dictionary in Kylin):
> http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> Dictionary Source (Traditional Chinese):
> https://raw.githubusercontent.com/fxsjy/jieba/master/extra_d
> ict/dict.txt.big
> ================Test Result================
> a. Dictionary Size:584429
> --------
> b. Build Time (ms) :
>    DAT       : 5714
>    HashMap   : 110
>    RadixTree : 22044
>    TrieDict  : 855
> --------
> c. Memory footprint in 64-bit JVM (bytes) :
>    DAT       : 16779752
>    HashMap   : 32196592
>    RadixTree : 46130584
>    TrieDict  : 10443608
> --------
> d. Retrieval Performance for 9935293 query times (ms) :
>    DAT       : 585
>    HashMap   : 1010
>    RadixTree : 417639
>    TrieDict  : 8664
> ================Test Result================
>
> ================Test Result================
> a. Dictionary Size:584429
> --------
> b. Build Time (ms) :
>    DAT       : 5867
>    HashMap   : 100
>    RadixTree : 22082
>    TrieDict  : 840
> --------
> c. Memory footprint in 64-bit JVM (bytes) :
>    DAT       : 16779752
>    HashMap   : 32196592
>    RadixTree : 46130584
>    TrieDict  : 10443608
> --------
> d. Retrieval Performance for 9935293 query times (ms) :
>    DAT       : 593
>    HashMap   : 821
>    RadixTree : 422297
>    TrieDict  : 8752
> ================Test Result================
>
> *2. Conclusion:*
> a. TrieDict is good for building tree and less memory footprint overhead,
> but worst retrieval performance,
> b. DAT is a good tradeoff between memory footprint and retrieval
> performance,
> c. RadixTree has the worst performance in different aspects.
>
> *3. Result Analysis:*
> a. With Trie the memory footprint of the TrieDict mapping is kinda
> minimized if compared to HashMap, in order to improve performance there is
> a cache layer overlays on top of Trie.
> b. Because a large number of duplicate prefix data, the total memory
> footprint is more than trie, meanwhile i think calculating string hash code
> of traditional Chinese consume considerable time overhead, so the
> performance is not the best.
> c. DAT is a better tradeoff.
> d. I have no idea why RadixTree has the worst performance in terms of
> memory, retrieval and building tree.
>
>
> On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[hidden email]>
> wrote:
>
> > Hi xiaoqiao
> >
> > ok, look forward to seeing your test result.
> > Can you take this task for this improvement? Please let me know if you
> need
> > any support :)
> >
> > Regards
> > Liang
> >
> >
> > hexiaoqiao wrote
> > > Hi Kumar Vishal,
> > >
> > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> we
> > > can get better memory footprint and also good performance. Of course,
> DAT
> > > is not only choice, and I will do test about DAT vs Radix Trie and
> > release
> > > the test result as soon as possible. Thanks your suggestions again.
> > >
> > > Regards,
> > > Xiaoqiao
> > >
> > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal &lt;
> >
> > > kumarvishal1802@
> >
> > > &gt;
> > > wrote:
> > >
> > >> Hi XIaoqiao He,
> > >> +1,
> > >> For forward dictionary case it will be very good optimisation, as our
> > >> case
> > >> is very specific storing byte array to int mapping[data to surrogate
> key
> > >> mapping], I think we will get much better memory footprint and
> > >> performance
> > >> will be also good(2x). We can also try radix tree(radix trie), it is
> > more
> > >> optimise for storage.
> > >>
> > >> -Regards
> > >> Kumar Vishal
> > >>
> > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen &lt;
> >
> > > chenliang6136@
> >
> > > &gt;
> > >> wrote:
> > >>
> > >> > Hi xiaoqiao
> > >> >
> > >> > For the below example, 600K dictionary data:
> > >> > It is to say that using "DAT" can save 36M memory against
> > >> > "ConcurrentHashMap", whereas the performance just lost less
> (1718ms) ?
> > >> >
> > >> > One more question:if increases the dictionary data size, what's the
> > >> > comparison results "ConcurrentHashMap" VS "DAT"
> > >> >
> > >> > Regards
> > >> > Liang
> > >> > ------------------------------------------------------------
> > >> > ------------------------------------------
> > >> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >> >
> > >> > b. retrieval performance: total time(ms) of 500 million query:
> > >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >> >
> > >> > Regards
> > >> > Liang
> > >> >
> > >> > hexiaoqiao wrote
> > >> > > hi Liang,
> > >> > >
> > >> > > Thanks for your reply, i need to correct the experiment result
> > >> because
> > >> > > it's
> > >> > > wrong order NO.1 column of result data table.
> > >> > >
> > >> > > In order to compare performance between Trie and HashMap, Two
> > >> different
> > >> > > structures are constructed using the same dictionary data which
> size
> > >> is
> > >> > > 600K and each item's length is between 2 and 50 bytes.
> > >> > >
> > >> > > ConcurrentHashMap (structure which is used in CarbonData
> currently)
> > >> vs
> > >> > > Double
> > >> > > Array Trie (one implementation of Trie Structures)
> > >> > >
> > >> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >> > >
> > >> > > b. retrieval performance: total time(ms) of 500 million query:
> > >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >> > >
> > >> > > Regards,
> > >> > > He Xiaoqiao
> > >> > >
> > >> > >
> > >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
> > >> >
> > >> > > chenliang6136@
> > >> >
> > >> > > &gt; wrote:
> > >> > >
> > >> > >> Hi xiaoqiao
> > >> > >>
> > >> > >> This improvement looks great!
> > >> > >> Can you please explain the below data, what does it mean?
> > >> > >> ----------
> > >> > >> ConcurrentHashMap
> > >> > >> ~68MB 14543
> > >> > >> Double Array Trie
> > >> > >> ~104MB 12825
> > >> > >>
> > >> > >> Regards
> > >> > >> Liang
> > >> > >>
> > >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
> > >> >
> > >> > > xq.he2009@
> > >> >
> > >> > > &gt;:
> > >> > >>
> > >> > >> >  Hi All,
> > >> > >> >
> > >> > >> > I would like to propose Dictionary improvement which using Trie
> > in
> > >> > >> place
> > >> > >> of
> > >> > >> > HashMap.
> > >> > >> >
> > >> > >> > In order to speedup aggregation, reduce run-time memory
> > footprint,
> > >> > >> enable
> > >> > >> > fast
> > >> > >> > distinct count etc, CarbonData encodes data using dictionary at
> > >> file
> > >> > >> level
> > >> > >> > or table level based on cardinality. It is a general and
> > efficient
> > >> way
> > >> > >> in
> > >> > >> > many big data systems, but when apply ConcurrentHashMap
> > >> > >> > to maintain Dictionary in CarbonData currently, memory overhead
> > of
> > >> > >> > Driver is very huge since it has to load whole Dictionary to
> > >> decode
> > >> > >> actual
> > >> > >> > data value, especially column cardinality is a large number.
> and
> > >> > >> CarbonData
> > >> > >> > will not do dictionary if cardinality > 1 million at default
> > >> behavior.
> > >> > >> >
> > >> > >> > I propose using Trie in place of HashMap for the following
> three
> > >> > >> reasons:
> > >> > >> > (1) Trie is a proper structure for Dictionary,
> > >> > >> > (2) Reduce memory footprint,
> > >> > >> > (3) Not impact retrieval performance
> > >> > >> >
> > >> > >> > The experimental results show that Trie is able to meet the
> > >> > >> requirement.
> > >> > >> > a. ConcurrentHashMap vs Double Array Trie
> > >> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> > >> > >> implementation of
> > >> > >> > Trie Structures)
> > >> > >> > b. Dictionary size: 600K
> > >> > >> > c. Memory footprint and query time
> > >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > >> > >> > ConcurrentHashMap
> > >> > >> > ~68MB 14543
> > >> > >> > Double Array Trie
> > >> > >> > ~104MB 12825
> > >> > >> >
> > >> > >> > Please share your suggestions about the proposed improvement of
> > >> > >> Dictionary.
> > >> > >> >
> > >> > >> > Regards
> > >> > >> > He Xiaoqiao
> > >> > >> >
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> --
> > >> > >> Regards
> > >> > >> Liang
> > >> > >>
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > View this message in context: http://apache-carbondata-
> > >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > >> > Dictionary-tp3132p3143.html
> > >> > Sent from the Apache CarbonData Mailing List archive mailing list
> > >> archive
> > >> > at Nabble.com.
> > >> >
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > Dictionary-tp3132p3186.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>