Apache CarbonData Dev Mailing List archive - Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Apache CarbonData Dev Mailing List archive

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Posted by hexiaoqiao on Nov 27, 2016; 9:27am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improvement-Use-Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary-tp3132p3240.html

Hi Kumar Vishal,

I'll create task to trace this issue.
Thanks for your suggestions.

Regards，
He Xiaoqiao

On Sun, Nov 27, 2016 at 1:41 AM, Kumar Vishal <[hidden email]>
wrote:

> Hi Xiaoqiao He,
>
> You can go ahead with DAT implementation, based on the result.
> I will look forward for you PR.
>
> Please let me know if you need any support:).
>
> -Regards
> KUmar Vishal
>
> On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He <[hidden email]> wrote:
>
> > Hi Liang, Kumar Vishal,
> >
> > I has done a standard benchmark about multiply data structures for
> > Dictionary following your suggestions. Based on the test results, I think
> > DAT may be the best choice for CarbonData.
> >
> > *1. Here are 2 test results:*
> > -----------------------------------------------------------------------
> > Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for
> Dictionary
> > HashMap : java.util.HashMap
> > DAT (Double Array Trie):
> > https://github.com/komiya-atsushi/darts-java
> > RadixTree:
> > https://github.com/npgall/concurrent-trees
> > TrieDict (Dictionary in Kylin):
> > http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> > Dictionary Source (Traditional Chinese):
> > https://raw.githubusercontent.com/fxsjy/jieba/master/extra_
> > dict/dict.txt.big
> > ================Test Result================
> > a. Dictionary Size:584429
> > --------
> > b. Build Time (ms) :
> > DAT : 5714
> > HashMap : 110
> > RadixTree : 22044
> > TrieDict : 855
> > --------
> > c. Memory footprint in 64-bit JVM (bytes) :
> > DAT : 16779752
> > HashMap : 32196592
> > RadixTree : 46130584
> > TrieDict : 10443608
> > --------
> > d. Retrieval Performance for 9935293 query times (ms) :
> > DAT : 585
> > HashMap : 1010
> > RadixTree : 417639
> > TrieDict : 8664
> > ================Test Result================
> >
> > ================Test Result================
> > a. Dictionary Size:584429
> > --------
> > b. Build Time (ms) :
> > DAT : 5867
> > HashMap : 100
> > RadixTree : 22082
> > TrieDict : 840
> > --------
> > c. Memory footprint in 64-bit JVM (bytes) :
> > DAT : 16779752
> > HashMap : 32196592
> > RadixTree : 46130584
> > TrieDict : 10443608
> > --------
> > d. Retrieval Performance for 9935293 query times (ms) :
> > DAT : 593
> > HashMap : 821
> > RadixTree : 422297
> > TrieDict : 8752
> > ================Test Result================
> >
> > *2. Conclusion:*
> > a. TrieDict is good for building tree and less memory footprint overhead,
> > but worst retrieval performance,
> > b. DAT is a good tradeoff between memory footprint and retrieval
> > performance,
> > c. RadixTree has the worst performance in different aspects.
> >
> > *3. Result Analysis:*
> > a. With Trie the memory footprint of the TrieDict mapping is kinda
> > minimized if compared to HashMap, in order to improve performance there
> is
> > a cache layer overlays on top of Trie.
> > b. Because a large number of duplicate prefix data, the total memory
> > footprint is more than trie, meanwhile i think calculating string hash
> code
> > of traditional Chinese consume considerable time overhead, so the
> > performance is not the best.
> > c. DAT is a better tradeoff.
> > d. I have no idea why RadixTree has the worst performance in terms of
> > memory, retrieval and building tree.
> >
> >
> > On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[hidden email]>
> > wrote:
> >
> > > Hi xiaoqiao
> > >
> > > ok, look forward to seeing your test result.
> > > Can you take this task for this improvement? Please let me know if you
> > need
> > > any support :)
> > >
> > > Regards
> > > Liang
> > >
> > >
> > > hexiaoqiao wrote
> > > > Hi Kumar Vishal,
> > > >
> > > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> > we
> > > > can get better memory footprint and also good performance. Of course,
> > DAT
> > > > is not only choice, and I will do test about DAT vs Radix Trie and
> > > release
> > > > the test result as soon as possible. Thanks your suggestions again.
> > > >
> > > > Regards,
> > > > Xiaoqiao
> > > >
> > > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal <
> > >
> > > > kumarvishal1802@
> > >
> > > > >
> > > > wrote:
> > > >
> > > >> Hi XIaoqiao He,
> > > >> +1,
> > > >> For forward dictionary case it will be very good optimisation, as
> our
> > > >> case
> > > >> is very specific storing byte array to int mapping[data to surrogate
> > key
> > > >> mapping], I think we will get much better memory footprint and
> > > >> performance
> > > >> will be also good(2x). We can also try radix tree(radix trie), it is
> > > more
> > > >> optimise for storage.
> > > >>
> > > >> -Regards
> > > >> Kumar Vishal
> > > >>
> > > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <
> > >
> > > > chenliang6136@
> > >
> > > > >
> > > >> wrote:
> > > >>
> > > >> > Hi xiaoqiao
> > > >> >
> > > >> > For the below example, 600K dictionary data:
> > > >> > It is to say that using "DAT" can save 36M memory against
> > > >> > "ConcurrentHashMap", whereas the performance just lost less
> > (1718ms) ?
> > > >> >
> > > >> > One more question:if increases the dictionary data size, what's
> the
> > > >> > comparison results "ConcurrentHashMap" VS "DAT"
> > > >> >
> > > >> > Regards
> > > >> > Liang
> > > >> > ------------------------------------------------------------
> > > >> > ------------------------------------------
> > > >> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > > >> >
> > > >> > b. retrieval performance: total time(ms) of 500 million query:
> > > >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > > >> >
> > > >> > Regards
> > > >> > Liang
> > > >> >
> > > >> > hexiaoqiao wrote
> > > >> > > hi Liang,
> > > >> > >
> > > >> > > Thanks for your reply, i need to correct the experiment result
> > > >> because
> > > >> > > it's
> > > >> > > wrong order NO.1 column of result data table.
> > > >> > >
> > > >> > > In order to compare performance between Trie and HashMap, Two
> > > >> different
> > > >> > > structures are constructed using the same dictionary data which
> > size
> > > >> is
> > > >> > > 600K and each item's length is between 2 and 50 bytes.
> > > >> > >
> > > >> > > ConcurrentHashMap (structure which is used in CarbonData
> > currently)
> > > >> vs
> > > >> > > Double
> > > >> > > Array Trie (one implementation of Trie Structures)
> > > >> > >
> > > >> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > > >> > >
> > > >> > > b. retrieval performance: total time(ms) of 500 million query:
> > > >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > > >> > >
> > > >> > > Regards,
> > > >> > > He Xiaoqiao
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <
> > > >> >
> > > >> > > chenliang6136@
> > > >> >
> > > >> > > > wrote:
> > > >> > >
> > > >> > >> Hi xiaoqiao
> > > >> > >>
> > > >> > >> This improvement looks great!
> > > >> > >> Can you please explain the below data, what does it mean?
> > > >> > >> ----------
> > > >> > >> ConcurrentHashMap
> > > >> > >> ~68MB 14543
> > > >> > >> Double Array Trie
> > > >> > >> ~104MB 12825
> > > >> > >>
> > > >> > >> Regards
> > > >> > >> Liang
> > > >> > >>
> > > >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <
> > > >> >
> > > >> > > xq.he2009@
> > > >> >
> > > >> > > >:
> > > >> > >>
> > > >> > >> > Hi All,
> > > >> > >> >
> > > >> > >> > I would like to propose Dictionary improvement which using
> Trie
> > > in
> > > >> > >> place
> > > >> > >> of
> > > >> > >> > HashMap.
> > > >> > >> >
> > > >> > >> > In order to speedup aggregation, reduce run-time memory
> > > footprint,
> > > >> > >> enable
> > > >> > >> > fast
> > > >> > >> > distinct count etc, CarbonData encodes data using dictionary
> at
> > > >> file
> > > >> > >> level
> > > >> > >> > or table level based on cardinality. It is a general and
> > > efficient
> > > >> way
> > > >> > >> in
> > > >> > >> > many big data systems, but when apply ConcurrentHashMap
> > > >> > >> > to maintain Dictionary in CarbonData currently, memory
> overhead
> > > of
> > > >> > >> > Driver is very huge since it has to load whole Dictionary to
> > > >> decode
> > > >> > >> actual
> > > >> > >> > data value, especially column cardinality is a large number.
> > and
> > > >> > >> CarbonData
> > > >> > >> > will not do dictionary if cardinality > 1 million at default
> > > >> behavior.
> > > >> > >> >
> > > >> > >> > I propose using Trie in place of HashMap for the following
> > three
> > > >> > >> reasons:
> > > >> > >> > (1) Trie is a proper structure for Dictionary,
> > > >> > >> > (2) Reduce memory footprint,
> > > >> > >> > (3) Not impact retrieval performance
> > > >> > >> >
> > > >> > >> > The experimental results show that Trie is able to meet the
> > > >> > >> requirement.
> > > >> > >> > a. ConcurrentHashMap vs Double Array Trie
> > > >> > >> > <https://linux.thai.net/~thep/datrie/datrie.html>(one
> > > >> > >> implementation of
> > > >> > >> > Trie Structures)
> > > >> > >> > b. Dictionary size: 600K
> > > >> > >> > c. Memory footprint and query time
> > > >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > > >> > >> > ConcurrentHashMap
> > > >> > >> > ~68MB 14543
> > > >> > >> > Double Array Trie
> > > >> > >> > ~104MB 12825
> > > >> > >> >
> > > >> > >> > Please share your suggestions about the proposed improvement
> of
> > > >> > >> Dictionary.
> > > >> > >> >
> > > >> > >> > Regards
> > > >> > >> > He Xiaoqiao
> > > >> > >> >
> > > >> > >>
> > > >> > >>
> > > >> > >>
> > > >> > >> --
> > > >> > >> Regards
> > > >> > >> Liang
> > > >> > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > View this message in context: http://apache-carbondata-
> > > >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > > >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > > >> > Dictionary-tp3132p3143.html
> > > >> > Sent from the Apache CarbonData Mailing List archive mailing list
> > > >> archive
> > > >> > at Nabble.com.
> > > >> >
> > > >>
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-carbondata-
> > > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > > Dictionary-tp3132p3186.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
>