http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improvement-Use-Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary-tp3132p3134.html
> Hi All,
>
> I would like to propose Dictionary improvement which using Trie in place of
> HashMap.
>
> In order to speedup aggregation, reduce run-time memory footprint, enable
> fast
> distinct count etc, CarbonData encodes data using dictionary at file level
> or table level based on cardinality. It is a general and efficient way in
> many big data systems, but when apply ConcurrentHashMap
> to maintain Dictionary in CarbonData currently, memory overhead of
> Driver is very huge since it has to load whole Dictionary to decode actual
> data value, especially column cardinality is a large number. and CarbonData
> will not do dictionary if cardinality > 1 million at default behavior.
>
> I propose using Trie in place of HashMap for the following three reasons:
> (1) Trie is a proper structure for Dictionary,
> (2) Reduce memory footprint,
> (3) Not impact retrieval performance
>
> The experimental results show that Trie is able to meet the requirement.
> a. ConcurrentHashMap vs Double Array Trie
> <
https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of
> Trie Structures)
> b. Dictionary size: 600K
> c. Memory footprint and query time
> - memory footprint (64-bit JVM) 500 million query time(ms)
> ConcurrentHashMap
> ~68MB 14543
> Double Array Trie
> ~104MB 12825
>
> Please share your suggestions about the proposed improvement of Dictionary.
>
> Regards
> He Xiaoqiao
>