[Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

hexiaoqiao
 Hi All,

I would like to propose Dictionary improvement which using Trie in place of
HashMap.

In order to speedup aggregation, reduce run-time memory footprint, enable fast
distinct count etc, CarbonData encodes data using dictionary at file level
or table level based on cardinality. It is a general and efficient way in
many big data systems, but when apply ConcurrentHashMap
to maintain Dictionary in CarbonData currently, memory overhead of
Driver is very huge since it has to load whole Dictionary to decode actual
data value, especially column cardinality is a large number. and CarbonData
will not do dictionary if cardinality > 1 million at default behavior.

I propose using Trie in place of HashMap for the following three reasons:
(1) Trie is a proper structure for Dictionary,
(2) Reduce memory footprint,
(3) Not impact retrieval performance

The experimental results show that Trie is able to meet the requirement.
a. ConcurrentHashMap vs Double Array Trie
<https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of
Trie Structures)
b. Dictionary size: 600K
c. Memory footprint and query time
- memory footprint (64-bit JVM) 500 million query time(ms)
ConcurrentHashMap
~68MB 14543
Double Array Trie
~104MB 12825

Please share your suggestions about the proposed improvement of Dictionary.

Regards
He Xiaoqiao
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Liang Chen
Administrator
Hi xiaoqiao

This improvement looks great!
Can you please explain the below data, what does it mean?
----------
ConcurrentHashMap
~68MB 14543
Double Array Trie
~104MB 12825

Regards
Liang

2016-11-24 2:04 GMT+08:00 Xiaoqiao He <[hidden email]>:

>  Hi All,
>
> I would like to propose Dictionary improvement which using Trie in place of
> HashMap.
>
> In order to speedup aggregation, reduce run-time memory footprint, enable
> fast
> distinct count etc, CarbonData encodes data using dictionary at file level
> or table level based on cardinality. It is a general and efficient way in
> many big data systems, but when apply ConcurrentHashMap
> to maintain Dictionary in CarbonData currently, memory overhead of
> Driver is very huge since it has to load whole Dictionary to decode actual
> data value, especially column cardinality is a large number. and CarbonData
> will not do dictionary if cardinality > 1 million at default behavior.
>
> I propose using Trie in place of HashMap for the following three reasons:
> (1) Trie is a proper structure for Dictionary,
> (2) Reduce memory footprint,
> (3) Not impact retrieval performance
>
> The experimental results show that Trie is able to meet the requirement.
> a. ConcurrentHashMap vs Double Array Trie
> <https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of
> Trie Structures)
> b. Dictionary size: 600K
> c. Memory footprint and query time
> - memory footprint (64-bit JVM) 500 million query time(ms)
> ConcurrentHashMap
> ~68MB 14543
> Double Array Trie
> ~104MB 12825
>
> Please share your suggestions about the proposed improvement of Dictionary.
>
> Regards
> He Xiaoqiao
>



--
Regards
Liang
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

hexiaoqiao
hi Liang,

Thanks for your reply, i need to correct the experiment result because it's
wrong order NO.1 column of result data table.

In order to compare performance between Trie and HashMap, Two different
structures are constructed using the same dictionary data which size is
600K and each item's length is between 2 and 50 bytes.

ConcurrentHashMap (structure which is used in CarbonData currently) vs Double
Array Trie (one implementation of Trie Structures)

a. memory footprint (approximate quantity) in 64-bit JVM:
~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)

b. retrieval performance: total time(ms) of 500 million query:
12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)

Regards,
He Xiaoqiao


On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <[hidden email]> wrote:

> Hi xiaoqiao
>
> This improvement looks great!
> Can you please explain the below data, what does it mean?
> ----------
> ConcurrentHashMap
> ~68MB 14543
> Double Array Trie
> ~104MB 12825
>
> Regards
> Liang
>
> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <[hidden email]>:
>
> >  Hi All,
> >
> > I would like to propose Dictionary improvement which using Trie in place
> of
> > HashMap.
> >
> > In order to speedup aggregation, reduce run-time memory footprint, enable
> > fast
> > distinct count etc, CarbonData encodes data using dictionary at file
> level
> > or table level based on cardinality. It is a general and efficient way in
> > many big data systems, but when apply ConcurrentHashMap
> > to maintain Dictionary in CarbonData currently, memory overhead of
> > Driver is very huge since it has to load whole Dictionary to decode
> actual
> > data value, especially column cardinality is a large number. and
> CarbonData
> > will not do dictionary if cardinality > 1 million at default behavior.
> >
> > I propose using Trie in place of HashMap for the following three reasons:
> > (1) Trie is a proper structure for Dictionary,
> > (2) Reduce memory footprint,
> > (3) Not impact retrieval performance
> >
> > The experimental results show that Trie is able to meet the requirement.
> > a. ConcurrentHashMap vs Double Array Trie
> > <https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of
> > Trie Structures)
> > b. Dictionary size: 600K
> > c. Memory footprint and query time
> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > ConcurrentHashMap
> > ~68MB 14543
> > Double Array Trie
> > ~104MB 12825
> >
> > Please share your suggestions about the proposed improvement of
> Dictionary.
> >
> > Regards
> > He Xiaoqiao
> >
>
>
>
> --
> Regards
> Liang
>
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Liang Chen
Administrator
Hi xiaoqiao

For the below example, 600K dictionary data:
It is to say that using "DAT" can save 36M memory against "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?

One more question:if increases the dictionary data size, what's the comparison results "ConcurrentHashMap" VS "DAT"

Regards
Liang
------------------------------------------------------------------------------------------------------
a. memory footprint (approximate quantity) in 64-bit JVM:
~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)

b. retrieval performance: total time(ms) of 500 million query:
12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)

Regards
Liang
hexiaoqiao wrote
hi Liang,

Thanks for your reply, i need to correct the experiment result because it's
wrong order NO.1 column of result data table.

In order to compare performance between Trie and HashMap, Two different
structures are constructed using the same dictionary data which size is
600K and each item's length is between 2 and 50 bytes.

ConcurrentHashMap (structure which is used in CarbonData currently) vs Double
Array Trie (one implementation of Trie Structures)

a. memory footprint (approximate quantity) in 64-bit JVM:
~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)

b. retrieval performance: total time(ms) of 500 million query:
12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)

Regards,
He Xiaoqiao


On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <[hidden email]> wrote:

> Hi xiaoqiao
>
> This improvement looks great!
> Can you please explain the below data, what does it mean?
> ----------
> ConcurrentHashMap
> ~68MB 14543
> Double Array Trie
> ~104MB 12825
>
> Regards
> Liang
>
> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <[hidden email]>:
>
> >  Hi All,
> >
> > I would like to propose Dictionary improvement which using Trie in place
> of
> > HashMap.
> >
> > In order to speedup aggregation, reduce run-time memory footprint, enable
> > fast
> > distinct count etc, CarbonData encodes data using dictionary at file
> level
> > or table level based on cardinality. It is a general and efficient way in
> > many big data systems, but when apply ConcurrentHashMap
> > to maintain Dictionary in CarbonData currently, memory overhead of
> > Driver is very huge since it has to load whole Dictionary to decode
> actual
> > data value, especially column cardinality is a large number. and
> CarbonData
> > will not do dictionary if cardinality > 1 million at default behavior.
> >
> > I propose using Trie in place of HashMap for the following three reasons:
> > (1) Trie is a proper structure for Dictionary,
> > (2) Reduce memory footprint,
> > (3) Not impact retrieval performance
> >
> > The experimental results show that Trie is able to meet the requirement.
> > a. ConcurrentHashMap vs Double Array Trie
> > <https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of
> > Trie Structures)
> > b. Dictionary size: 600K
> > c. Memory footprint and query time
> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > ConcurrentHashMap
> > ~68MB 14543
> > Double Array Trie
> > ~104MB 12825
> >
> > Please share your suggestions about the proposed improvement of
> Dictionary.
> >
> > Regards
> > He Xiaoqiao
> >
>
>
>
> --
> Regards
> Liang
>
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

kumarvishal09
Hi XIaoqiao He,
+1,
For forward dictionary case it will be very good optimisation, as our case
is very specific storing byte array to int mapping[data to surrogate key
mapping], I think we will get much better memory footprint and performance
will be also good(2x). We can also try radix tree(radix trie), it is more
optimise for storage.

-Regards
Kumar Vishal

On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <[hidden email]>
wrote:

> Hi xiaoqiao
>
> For the below example, 600K dictionary data:
> It is to say that using "DAT" can save 36M memory against
> "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>
> One more question:if increases the dictionary data size, what's the
> comparison results "ConcurrentHashMap" VS "DAT"
>
> Regards
> Liang
> ------------------------------------------------------------
> ------------------------------------------
> a. memory footprint (approximate quantity) in 64-bit JVM:
> ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>
> b. retrieval performance: total time(ms) of 500 million query:
> 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>
> Regards
> Liang
>
> hexiaoqiao wrote
> > hi Liang,
> >
> > Thanks for your reply, i need to correct the experiment result because
> > it's
> > wrong order NO.1 column of result data table.
> >
> > In order to compare performance between Trie and HashMap, Two different
> > structures are constructed using the same dictionary data which size is
> > 600K and each item's length is between 2 and 50 bytes.
> >
> > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > Double
> > Array Trie (one implementation of Trie Structures)
> >
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards,
> > He Xiaoqiao
> >
> >
> > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
>
> > chenliang6136@
>
> > &gt; wrote:
> >
> >> Hi xiaoqiao
> >>
> >> This improvement looks great!
> >> Can you please explain the below data, what does it mean?
> >> ----------
> >> ConcurrentHashMap
> >> ~68MB 14543
> >> Double Array Trie
> >> ~104MB 12825
> >>
> >> Regards
> >> Liang
> >>
> >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
>
> > xq.he2009@
>
> > &gt;:
> >>
> >> >  Hi All,
> >> >
> >> > I would like to propose Dictionary improvement which using Trie in
> >> place
> >> of
> >> > HashMap.
> >> >
> >> > In order to speedup aggregation, reduce run-time memory footprint,
> >> enable
> >> > fast
> >> > distinct count etc, CarbonData encodes data using dictionary at file
> >> level
> >> > or table level based on cardinality. It is a general and efficient way
> >> in
> >> > many big data systems, but when apply ConcurrentHashMap
> >> > to maintain Dictionary in CarbonData currently, memory overhead of
> >> > Driver is very huge since it has to load whole Dictionary to decode
> >> actual
> >> > data value, especially column cardinality is a large number. and
> >> CarbonData
> >> > will not do dictionary if cardinality > 1 million at default behavior.
> >> >
> >> > I propose using Trie in place of HashMap for the following three
> >> reasons:
> >> > (1) Trie is a proper structure for Dictionary,
> >> > (2) Reduce memory footprint,
> >> > (3) Not impact retrieval performance
> >> >
> >> > The experimental results show that Trie is able to meet the
> >> requirement.
> >> > a. ConcurrentHashMap vs Double Array Trie
> >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> >> implementation of
> >> > Trie Structures)
> >> > b. Dictionary size: 600K
> >> > c. Memory footprint and query time
> >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > ConcurrentHashMap
> >> > ~68MB 14543
> >> > Double Array Trie
> >> > ~104MB 12825
> >> >
> >> > Please share your suggestions about the proposed improvement of
> >> Dictionary.
> >> >
> >> > Regards
> >> > He Xiaoqiao
> >> >
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> Dictionary-tp3132p3143.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

hexiaoqiao
In reply to this post by Liang Chen
Hi Liang,

Generally, yes, because the same prefix of items in dictionary does not
require to repeat in DAT, and more data better result.

Actually the cost of DAT is building Tree, and i don't think we need to
consider it since this cost appears only once when load data.

FYI.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 2:42 PM, Liang Chen <[hidden email]> wrote:

> Hi xiaoqiao
>
> For the below example, 600K dictionary data:
> It is to say that using "DAT" can save 36M memory against
> "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>
> One more question:if increases the dictionary data size, what's the
> comparison results "ConcurrentHashMap" VS "DAT"
>
> Regards
> Liang
> ------------------------------------------------------------
> ------------------------------------------
> a. memory footprint (approximate quantity) in 64-bit JVM:
> ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>
> b. retrieval performance: total time(ms) of 500 million query:
> 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>
> Regards
> Liang
>
> hexiaoqiao wrote
> > hi Liang,
> >
> > Thanks for your reply, i need to correct the experiment result because
> > it's
> > wrong order NO.1 column of result data table.
> >
> > In order to compare performance between Trie and HashMap, Two different
> > structures are constructed using the same dictionary data which size is
> > 600K and each item's length is between 2 and 50 bytes.
> >
> > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > Double
> > Array Trie (one implementation of Trie Structures)
> >
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards,
> > He Xiaoqiao
> >
> >
> > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
>
> > chenliang6136@
>
> > &gt; wrote:
> >
> >> Hi xiaoqiao
> >>
> >> This improvement looks great!
> >> Can you please explain the below data, what does it mean?
> >> ----------
> >> ConcurrentHashMap
> >> ~68MB 14543
> >> Double Array Trie
> >> ~104MB 12825
> >>
> >> Regards
> >> Liang
> >>
> >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
>
> > xq.he2009@
>
> > &gt;:
> >>
> >> >  Hi All,
> >> >
> >> > I would like to propose Dictionary improvement which using Trie in
> >> place
> >> of
> >> > HashMap.
> >> >
> >> > In order to speedup aggregation, reduce run-time memory footprint,
> >> enable
> >> > fast
> >> > distinct count etc, CarbonData encodes data using dictionary at file
> >> level
> >> > or table level based on cardinality. It is a general and efficient way
> >> in
> >> > many big data systems, but when apply ConcurrentHashMap
> >> > to maintain Dictionary in CarbonData currently, memory overhead of
> >> > Driver is very huge since it has to load whole Dictionary to decode
> >> actual
> >> > data value, especially column cardinality is a large number. and
> >> CarbonData
> >> > will not do dictionary if cardinality > 1 million at default behavior.
> >> >
> >> > I propose using Trie in place of HashMap for the following three
> >> reasons:
> >> > (1) Trie is a proper structure for Dictionary,
> >> > (2) Reduce memory footprint,
> >> > (3) Not impact retrieval performance
> >> >
> >> > The experimental results show that Trie is able to meet the
> >> requirement.
> >> > a. ConcurrentHashMap vs Double Array Trie
> >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> >> implementation of
> >> > Trie Structures)
> >> > b. Dictionary size: 600K
> >> > c. Memory footprint and query time
> >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > ConcurrentHashMap
> >> > ~68MB 14543
> >> > Double Array Trie
> >> > ~104MB 12825
> >> >
> >> > Please share your suggestions about the proposed improvement of
> >> Dictionary.
> >> >
> >> > Regards
> >> > He Xiaoqiao
> >> >
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-maili
> ng-list-archive.1130556.n5.nabble.com/Improvement-Use-Trie-
> in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary
> -tp3132p3143.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

hexiaoqiao
In reply to this post by kumarvishal09
Hi Kumar Vishal,

Thanks for your suggestions. As you said, choose Trie replace HashMap we
can get better memory footprint and also good performance. Of course, DAT
is not only choice, and I will do test about DAT vs Radix Trie and release
the test result as soon as possible. Thanks your suggestions again.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal <[hidden email]>
wrote:

> Hi XIaoqiao He,
> +1,
> For forward dictionary case it will be very good optimisation, as our case
> is very specific storing byte array to int mapping[data to surrogate key
> mapping], I think we will get much better memory footprint and performance
> will be also good(2x). We can also try radix tree(radix trie), it is more
> optimise for storage.
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <[hidden email]>
> wrote:
>
> > Hi xiaoqiao
> >
> > For the below example, 600K dictionary data:
> > It is to say that using "DAT" can save 36M memory against
> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >
> > One more question:if increases the dictionary data size, what's the
> > comparison results "ConcurrentHashMap" VS "DAT"
> >
> > Regards
> > Liang
> > ------------------------------------------------------------
> > ------------------------------------------
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards
> > Liang
> >
> > hexiaoqiao wrote
> > > hi Liang,
> > >
> > > Thanks for your reply, i need to correct the experiment result because
> > > it's
> > > wrong order NO.1 column of result data table.
> > >
> > > In order to compare performance between Trie and HashMap, Two different
> > > structures are constructed using the same dictionary data which size is
> > > 600K and each item's length is between 2 and 50 bytes.
> > >
> > > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > > Double
> > > Array Trie (one implementation of Trie Structures)
> > >
> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >
> > > b. retrieval performance: total time(ms) of 500 million query:
> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >
> > > Regards,
> > > He Xiaoqiao
> > >
> > >
> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
> >
> > > chenliang6136@
> >
> > > &gt; wrote:
> > >
> > >> Hi xiaoqiao
> > >>
> > >> This improvement looks great!
> > >> Can you please explain the below data, what does it mean?
> > >> ----------
> > >> ConcurrentHashMap
> > >> ~68MB 14543
> > >> Double Array Trie
> > >> ~104MB 12825
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
> >
> > > xq.he2009@
> >
> > > &gt;:
> > >>
> > >> >  Hi All,
> > >> >
> > >> > I would like to propose Dictionary improvement which using Trie in
> > >> place
> > >> of
> > >> > HashMap.
> > >> >
> > >> > In order to speedup aggregation, reduce run-time memory footprint,
> > >> enable
> > >> > fast
> > >> > distinct count etc, CarbonData encodes data using dictionary at file
> > >> level
> > >> > or table level based on cardinality. It is a general and efficient
> way
> > >> in
> > >> > many big data systems, but when apply ConcurrentHashMap
> > >> > to maintain Dictionary in CarbonData currently, memory overhead of
> > >> > Driver is very huge since it has to load whole Dictionary to decode
> > >> actual
> > >> > data value, especially column cardinality is a large number. and
> > >> CarbonData
> > >> > will not do dictionary if cardinality > 1 million at default
> behavior.
> > >> >
> > >> > I propose using Trie in place of HashMap for the following three
> > >> reasons:
> > >> > (1) Trie is a proper structure for Dictionary,
> > >> > (2) Reduce memory footprint,
> > >> > (3) Not impact retrieval performance
> > >> >
> > >> > The experimental results show that Trie is able to meet the
> > >> requirement.
> > >> > a. ConcurrentHashMap vs Double Array Trie
> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> > >> implementation of
> > >> > Trie Structures)
> > >> > b. Dictionary size: 600K
> > >> > c. Memory footprint and query time
> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > >> > ConcurrentHashMap
> > >> > ~68MB 14543
> > >> > Double Array Trie
> > >> > ~104MB 12825
> > >> >
> > >> > Please share your suggestions about the proposed improvement of
> > >> Dictionary.
> > >> >
> > >> > Regards
> > >> > He Xiaoqiao
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Regards
> > >> Liang
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > Dictionary-tp3132p3143.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Liang Chen
Administrator
Hi xiaoqiao

ok, look forward to seeing your test result.
Can you take this task for this improvement? Please let me know if you need any support :)

Regards
Liang

hexiaoqiao wrote
Hi Kumar Vishal,

Thanks for your suggestions. As you said, choose Trie replace HashMap we
can get better memory footprint and also good performance. Of course, DAT
is not only choice, and I will do test about DAT vs Radix Trie and release
the test result as soon as possible. Thanks your suggestions again.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal <[hidden email]>
wrote:

> Hi XIaoqiao He,
> +1,
> For forward dictionary case it will be very good optimisation, as our case
> is very specific storing byte array to int mapping[data to surrogate key
> mapping], I think we will get much better memory footprint and performance
> will be also good(2x). We can also try radix tree(radix trie), it is more
> optimise for storage.
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <[hidden email]>
> wrote:
>
> > Hi xiaoqiao
> >
> > For the below example, 600K dictionary data:
> > It is to say that using "DAT" can save 36M memory against
> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >
> > One more question:if increases the dictionary data size, what's the
> > comparison results "ConcurrentHashMap" VS "DAT"
> >
> > Regards
> > Liang
> > ------------------------------------------------------------
> > ------------------------------------------
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards
> > Liang
> >
> > hexiaoqiao wrote
> > > hi Liang,
> > >
> > > Thanks for your reply, i need to correct the experiment result because
> > > it's
> > > wrong order NO.1 column of result data table.
> > >
> > > In order to compare performance between Trie and HashMap, Two different
> > > structures are constructed using the same dictionary data which size is
> > > 600K and each item's length is between 2 and 50 bytes.
> > >
> > > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > > Double
> > > Array Trie (one implementation of Trie Structures)
> > >
> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >
> > > b. retrieval performance: total time(ms) of 500 million query:
> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >
> > > Regards,
> > > He Xiaoqiao
> > >
> > >
> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <
> >
> > > chenliang6136@
> >
> > > > wrote:
> > >
> > >> Hi xiaoqiao
> > >>
> > >> This improvement looks great!
> > >> Can you please explain the below data, what does it mean?
> > >> ----------
> > >> ConcurrentHashMap
> > >> ~68MB 14543
> > >> Double Array Trie
> > >> ~104MB 12825
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <
> >
> > > xq.he2009@
> >
> > > >:
> > >>
> > >> >  Hi All,
> > >> >
> > >> > I would like to propose Dictionary improvement which using Trie in
> > >> place
> > >> of
> > >> > HashMap.
> > >> >
> > >> > In order to speedup aggregation, reduce run-time memory footprint,
> > >> enable
> > >> > fast
> > >> > distinct count etc, CarbonData encodes data using dictionary at file
> > >> level
> > >> > or table level based on cardinality. It is a general and efficient
> way
> > >> in
> > >> > many big data systems, but when apply ConcurrentHashMap
> > >> > to maintain Dictionary in CarbonData currently, memory overhead of
> > >> > Driver is very huge since it has to load whole Dictionary to decode
> > >> actual
> > >> > data value, especially column cardinality is a large number. and
> > >> CarbonData
> > >> > will not do dictionary if cardinality > 1 million at default
> behavior.
> > >> >
> > >> > I propose using Trie in place of HashMap for the following three
> > >> reasons:
> > >> > (1) Trie is a proper structure for Dictionary,
> > >> > (2) Reduce memory footprint,
> > >> > (3) Not impact retrieval performance
> > >> >
> > >> > The experimental results show that Trie is able to meet the
> > >> requirement.
> > >> > a. ConcurrentHashMap vs Double Array Trie
> > >> > <https://linux.thai.net/~thep/datrie/datrie.html>(one
> > >> implementation of
> > >> > Trie Structures)
> > >> > b. Dictionary size: 600K
> > >> > c. Memory footprint and query time
> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > >> > ConcurrentHashMap
> > >> > ~68MB 14543
> > >> > Double Array Trie
> > >> > ~104MB 12825
> > >> >
> > >> > Please share your suggestions about the proposed improvement of
> > >> Dictionary.
> > >> >
> > >> > Regards
> > >> > He Xiaoqiao
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Regards
> > >> Liang
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > Dictionary-tp3132p3143.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

hexiaoqiao
Hi Liang, Kumar Vishal,

I has done a standard benchmark about multiply data structures for
Dictionary following your suggestions. Based on the test results, I think
DAT may be the best choice for CarbonData.

*1. Here are 2 test results:*
-----------------------------------------------------------------------
Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
  HashMap :                               java.util.HashMap
  DAT (Double Array Trie):
https://github.com/komiya-atsushi/darts-java
  RadixTree:
https://github.com/npgall/concurrent-trees
  TrieDict (Dictionary in Kylin):
http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
Dictionary Source (Traditional Chinese):
https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big
================Test Result================
a. Dictionary Size:584429
--------
b. Build Time (ms) :
   DAT       : 5714
   HashMap   : 110
   RadixTree : 22044
   TrieDict  : 855
--------
c. Memory footprint in 64-bit JVM (bytes) :
   DAT       : 16779752
   HashMap   : 32196592
   RadixTree : 46130584
   TrieDict  : 10443608
--------
d. Retrieval Performance for 9935293 query times (ms) :
   DAT       : 585
   HashMap   : 1010
   RadixTree : 417639
   TrieDict  : 8664
================Test Result================

================Test Result================
a. Dictionary Size:584429
--------
b. Build Time (ms) :
   DAT       : 5867
   HashMap   : 100
   RadixTree : 22082
   TrieDict  : 840
--------
c. Memory footprint in 64-bit JVM (bytes) :
   DAT       : 16779752
   HashMap   : 32196592
   RadixTree : 46130584
   TrieDict  : 10443608
--------
d. Retrieval Performance for 9935293 query times (ms) :
   DAT       : 593
   HashMap   : 821
   RadixTree : 422297
   TrieDict  : 8752
================Test Result================

*2. Conclusion:*
a. TrieDict is good for building tree and less memory footprint overhead,
but worst retrieval performance,
b. DAT is a good tradeoff between memory footprint and retrieval
performance,
c. RadixTree has the worst performance in different aspects.

*3. Result Analysis:*
a. With Trie the memory footprint of the TrieDict mapping is kinda
minimized if compared to HashMap, in order to improve performance there is
a cache layer overlays on top of Trie.
b. Because a large number of duplicate prefix data, the total memory
footprint is more than trie, meanwhile i think calculating string hash code
of traditional Chinese consume considerable time overhead, so the
performance is not the best.
c. DAT is a better tradeoff.
d. I have no idea why RadixTree has the worst performance in terms of
memory, retrieval and building tree.


On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[hidden email]>
wrote:

> Hi xiaoqiao
>
> ok, look forward to seeing your test result.
> Can you take this task for this improvement? Please let me know if you need
> any support :)
>
> Regards
> Liang
>
>
> hexiaoqiao wrote
> > Hi Kumar Vishal,
> >
> > Thanks for your suggestions. As you said, choose Trie replace HashMap we
> > can get better memory footprint and also good performance. Of course, DAT
> > is not only choice, and I will do test about DAT vs Radix Trie and
> release
> > the test result as soon as possible. Thanks your suggestions again.
> >
> > Regards,
> > Xiaoqiao
> >
> > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal &lt;
>
> > kumarvishal1802@
>
> > &gt;
> > wrote:
> >
> >> Hi XIaoqiao He,
> >> +1,
> >> For forward dictionary case it will be very good optimisation, as our
> >> case
> >> is very specific storing byte array to int mapping[data to surrogate key
> >> mapping], I think we will get much better memory footprint and
> >> performance
> >> will be also good(2x). We can also try radix tree(radix trie), it is
> more
> >> optimise for storage.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen &lt;
>
> > chenliang6136@
>
> > &gt;
> >> wrote:
> >>
> >> > Hi xiaoqiao
> >> >
> >> > For the below example, 600K dictionary data:
> >> > It is to say that using "DAT" can save 36M memory against
> >> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >> >
> >> > One more question:if increases the dictionary data size, what's the
> >> > comparison results "ConcurrentHashMap" VS "DAT"
> >> >
> >> > Regards
> >> > Liang
> >> > ------------------------------------------------------------
> >> > ------------------------------------------
> >> > a. memory footprint (approximate quantity) in 64-bit JVM:
> >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >> >
> >> > b. retrieval performance: total time(ms) of 500 million query:
> >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >> >
> >> > Regards
> >> > Liang
> >> >
> >> > hexiaoqiao wrote
> >> > > hi Liang,
> >> > >
> >> > > Thanks for your reply, i need to correct the experiment result
> >> because
> >> > > it's
> >> > > wrong order NO.1 column of result data table.
> >> > >
> >> > > In order to compare performance between Trie and HashMap, Two
> >> different
> >> > > structures are constructed using the same dictionary data which size
> >> is
> >> > > 600K and each item's length is between 2 and 50 bytes.
> >> > >
> >> > > ConcurrentHashMap (structure which is used in CarbonData currently)
> >> vs
> >> > > Double
> >> > > Array Trie (one implementation of Trie Structures)
> >> > >
> >> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >> > >
> >> > > b. retrieval performance: total time(ms) of 500 million query:
> >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >> > >
> >> > > Regards,
> >> > > He Xiaoqiao
> >> > >
> >> > >
> >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
> >> >
> >> > > chenliang6136@
> >> >
> >> > > &gt; wrote:
> >> > >
> >> > >> Hi xiaoqiao
> >> > >>
> >> > >> This improvement looks great!
> >> > >> Can you please explain the below data, what does it mean?
> >> > >> ----------
> >> > >> ConcurrentHashMap
> >> > >> ~68MB 14543
> >> > >> Double Array Trie
> >> > >> ~104MB 12825
> >> > >>
> >> > >> Regards
> >> > >> Liang
> >> > >>
> >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
> >> >
> >> > > xq.he2009@
> >> >
> >> > > &gt;:
> >> > >>
> >> > >> >  Hi All,
> >> > >> >
> >> > >> > I would like to propose Dictionary improvement which using Trie
> in
> >> > >> place
> >> > >> of
> >> > >> > HashMap.
> >> > >> >
> >> > >> > In order to speedup aggregation, reduce run-time memory
> footprint,
> >> > >> enable
> >> > >> > fast
> >> > >> > distinct count etc, CarbonData encodes data using dictionary at
> >> file
> >> > >> level
> >> > >> > or table level based on cardinality. It is a general and
> efficient
> >> way
> >> > >> in
> >> > >> > many big data systems, but when apply ConcurrentHashMap
> >> > >> > to maintain Dictionary in CarbonData currently, memory overhead
> of
> >> > >> > Driver is very huge since it has to load whole Dictionary to
> >> decode
> >> > >> actual
> >> > >> > data value, especially column cardinality is a large number. and
> >> > >> CarbonData
> >> > >> > will not do dictionary if cardinality > 1 million at default
> >> behavior.
> >> > >> >
> >> > >> > I propose using Trie in place of HashMap for the following three
> >> > >> reasons:
> >> > >> > (1) Trie is a proper structure for Dictionary,
> >> > >> > (2) Reduce memory footprint,
> >> > >> > (3) Not impact retrieval performance
> >> > >> >
> >> > >> > The experimental results show that Trie is able to meet the
> >> > >> requirement.
> >> > >> > a. ConcurrentHashMap vs Double Array Trie
> >> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> >> > >> implementation of
> >> > >> > Trie Structures)
> >> > >> > b. Dictionary size: 600K
> >> > >> > c. Memory footprint and query time
> >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > >> > ConcurrentHashMap
> >> > >> > ~68MB 14543
> >> > >> > Double Array Trie
> >> > >> > ~104MB 12825
> >> > >> >
> >> > >> > Please share your suggestions about the proposed improvement of
> >> > >> Dictionary.
> >> > >> >
> >> > >> > Regards
> >> > >> > He Xiaoqiao
> >> > >> >
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Regards
> >> > >> Liang
> >> > >>
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://apache-carbondata-
> >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> >> > Dictionary-tp3132p3143.html
> >> > Sent from the Apache CarbonData Mailing List archive mailing list
> >> archive
> >> > at Nabble.com.
> >> >
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> Dictionary-tp3132p3186.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

kumarvishal09
Hi Xiaoqiao He,

You can go ahead with DAT implementation, based on the result.
I will look forward for you PR.

Please let me know if you need any support:).

-Regards
KUmar Vishal

On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He <[hidden email]> wrote:

> Hi Liang, Kumar Vishal,
>
> I has done a standard benchmark about multiply data structures for
> Dictionary following your suggestions. Based on the test results, I think
> DAT may be the best choice for CarbonData.
>
> *1. Here are 2 test results:*
> -----------------------------------------------------------------------
> Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
>   HashMap :                               java.util.HashMap
>   DAT (Double Array Trie):
> https://github.com/komiya-atsushi/darts-java
>   RadixTree:
> https://github.com/npgall/concurrent-trees
>   TrieDict (Dictionary in Kylin):
> http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> Dictionary Source (Traditional Chinese):
> https://raw.githubusercontent.com/fxsjy/jieba/master/extra_
> dict/dict.txt.big
> ================Test Result================
> a. Dictionary Size:584429
> --------
> b. Build Time (ms) :
>    DAT       : 5714
>    HashMap   : 110
>    RadixTree : 22044
>    TrieDict  : 855
> --------
> c. Memory footprint in 64-bit JVM (bytes) :
>    DAT       : 16779752
>    HashMap   : 32196592
>    RadixTree : 46130584
>    TrieDict  : 10443608
> --------
> d. Retrieval Performance for 9935293 query times (ms) :
>    DAT       : 585
>    HashMap   : 1010
>    RadixTree : 417639
>    TrieDict  : 8664
> ================Test Result================
>
> ================Test Result================
> a. Dictionary Size:584429
> --------
> b. Build Time (ms) :
>    DAT       : 5867
>    HashMap   : 100
>    RadixTree : 22082
>    TrieDict  : 840
> --------
> c. Memory footprint in 64-bit JVM (bytes) :
>    DAT       : 16779752
>    HashMap   : 32196592
>    RadixTree : 46130584
>    TrieDict  : 10443608
> --------
> d. Retrieval Performance for 9935293 query times (ms) :
>    DAT       : 593
>    HashMap   : 821
>    RadixTree : 422297
>    TrieDict  : 8752
> ================Test Result================
>
> *2. Conclusion:*
> a. TrieDict is good for building tree and less memory footprint overhead,
> but worst retrieval performance,
> b. DAT is a good tradeoff between memory footprint and retrieval
> performance,
> c. RadixTree has the worst performance in different aspects.
>
> *3. Result Analysis:*
> a. With Trie the memory footprint of the TrieDict mapping is kinda
> minimized if compared to HashMap, in order to improve performance there is
> a cache layer overlays on top of Trie.
> b. Because a large number of duplicate prefix data, the total memory
> footprint is more than trie, meanwhile i think calculating string hash code
> of traditional Chinese consume considerable time overhead, so the
> performance is not the best.
> c. DAT is a better tradeoff.
> d. I have no idea why RadixTree has the worst performance in terms of
> memory, retrieval and building tree.
>
>
> On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[hidden email]>
> wrote:
>
> > Hi xiaoqiao
> >
> > ok, look forward to seeing your test result.
> > Can you take this task for this improvement? Please let me know if you
> need
> > any support :)
> >
> > Regards
> > Liang
> >
> >
> > hexiaoqiao wrote
> > > Hi Kumar Vishal,
> > >
> > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> we
> > > can get better memory footprint and also good performance. Of course,
> DAT
> > > is not only choice, and I will do test about DAT vs Radix Trie and
> > release
> > > the test result as soon as possible. Thanks your suggestions again.
> > >
> > > Regards,
> > > Xiaoqiao
> > >
> > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal &lt;
> >
> > > kumarvishal1802@
> >
> > > &gt;
> > > wrote:
> > >
> > >> Hi XIaoqiao He,
> > >> +1,
> > >> For forward dictionary case it will be very good optimisation, as our
> > >> case
> > >> is very specific storing byte array to int mapping[data to surrogate
> key
> > >> mapping], I think we will get much better memory footprint and
> > >> performance
> > >> will be also good(2x). We can also try radix tree(radix trie), it is
> > more
> > >> optimise for storage.
> > >>
> > >> -Regards
> > >> Kumar Vishal
> > >>
> > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen &lt;
> >
> > > chenliang6136@
> >
> > > &gt;
> > >> wrote:
> > >>
> > >> > Hi xiaoqiao
> > >> >
> > >> > For the below example, 600K dictionary data:
> > >> > It is to say that using "DAT" can save 36M memory against
> > >> > "ConcurrentHashMap", whereas the performance just lost less
> (1718ms) ?
> > >> >
> > >> > One more question:if increases the dictionary data size, what's the
> > >> > comparison results "ConcurrentHashMap" VS "DAT"
> > >> >
> > >> > Regards
> > >> > Liang
> > >> > ------------------------------------------------------------
> > >> > ------------------------------------------
> > >> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >> >
> > >> > b. retrieval performance: total time(ms) of 500 million query:
> > >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >> >
> > >> > Regards
> > >> > Liang
> > >> >
> > >> > hexiaoqiao wrote
> > >> > > hi Liang,
> > >> > >
> > >> > > Thanks for your reply, i need to correct the experiment result
> > >> because
> > >> > > it's
> > >> > > wrong order NO.1 column of result data table.
> > >> > >
> > >> > > In order to compare performance between Trie and HashMap, Two
> > >> different
> > >> > > structures are constructed using the same dictionary data which
> size
> > >> is
> > >> > > 600K and each item's length is between 2 and 50 bytes.
> > >> > >
> > >> > > ConcurrentHashMap (structure which is used in CarbonData
> currently)
> > >> vs
> > >> > > Double
> > >> > > Array Trie (one implementation of Trie Structures)
> > >> > >
> > >> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >> > >
> > >> > > b. retrieval performance: total time(ms) of 500 million query:
> > >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >> > >
> > >> > > Regards,
> > >> > > He Xiaoqiao
> > >> > >
> > >> > >
> > >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
> > >> >
> > >> > > chenliang6136@
> > >> >
> > >> > > &gt; wrote:
> > >> > >
> > >> > >> Hi xiaoqiao
> > >> > >>
> > >> > >> This improvement looks great!
> > >> > >> Can you please explain the below data, what does it mean?
> > >> > >> ----------
> > >> > >> ConcurrentHashMap
> > >> > >> ~68MB 14543
> > >> > >> Double Array Trie
> > >> > >> ~104MB 12825
> > >> > >>
> > >> > >> Regards
> > >> > >> Liang
> > >> > >>
> > >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
> > >> >
> > >> > > xq.he2009@
> > >> >
> > >> > > &gt;:
> > >> > >>
> > >> > >> >  Hi All,
> > >> > >> >
> > >> > >> > I would like to propose Dictionary improvement which using Trie
> > in
> > >> > >> place
> > >> > >> of
> > >> > >> > HashMap.
> > >> > >> >
> > >> > >> > In order to speedup aggregation, reduce run-time memory
> > footprint,
> > >> > >> enable
> > >> > >> > fast
> > >> > >> > distinct count etc, CarbonData encodes data using dictionary at
> > >> file
> > >> > >> level
> > >> > >> > or table level based on cardinality. It is a general and
> > efficient
> > >> way
> > >> > >> in
> > >> > >> > many big data systems, but when apply ConcurrentHashMap
> > >> > >> > to maintain Dictionary in CarbonData currently, memory overhead
> > of
> > >> > >> > Driver is very huge since it has to load whole Dictionary to
> > >> decode
> > >> > >> actual
> > >> > >> > data value, especially column cardinality is a large number.
> and
> > >> > >> CarbonData
> > >> > >> > will not do dictionary if cardinality > 1 million at default
> > >> behavior.
> > >> > >> >
> > >> > >> > I propose using Trie in place of HashMap for the following
> three
> > >> > >> reasons:
> > >> > >> > (1) Trie is a proper structure for Dictionary,
> > >> > >> > (2) Reduce memory footprint,
> > >> > >> > (3) Not impact retrieval performance
> > >> > >> >
> > >> > >> > The experimental results show that Trie is able to meet the
> > >> > >> requirement.
> > >> > >> > a. ConcurrentHashMap vs Double Array Trie
> > >> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> > >> > >> implementation of
> > >> > >> > Trie Structures)
> > >> > >> > b. Dictionary size: 600K
> > >> > >> > c. Memory footprint and query time
> > >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > >> > >> > ConcurrentHashMap
> > >> > >> > ~68MB 14543
> > >> > >> > Double Array Trie
> > >> > >> > ~104MB 12825
> > >> > >> >
> > >> > >> > Please share your suggestions about the proposed improvement of
> > >> > >> Dictionary.
> > >> > >> >
> > >> > >> > Regards
> > >> > >> > He Xiaoqiao
> > >> > >> >
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> --
> > >> > >> Regards
> > >> > >> Liang
> > >> > >>
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > View this message in context: http://apache-carbondata-
> > >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > >> > Dictionary-tp3132p3143.html
> > >> > Sent from the Apache CarbonData Mailing List archive mailing list
> > >> archive
> > >> > at Nabble.com.
> > >> >
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > Dictionary-tp3132p3186.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

hexiaoqiao
Hi Kumar Vishal,

I'll create task to trace this issue.
Thanks for your suggestions.

Regards,
He Xiaoqiao


On Sun, Nov 27, 2016 at 1:41 AM, Kumar Vishal <[hidden email]>
wrote:

> Hi Xiaoqiao He,
>
> You can go ahead with DAT implementation, based on the result.
> I will look forward for you PR.
>
> Please let me know if you need any support:).
>
> -Regards
> KUmar Vishal
>
> On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He <[hidden email]> wrote:
>
> > Hi Liang, Kumar Vishal,
> >
> > I has done a standard benchmark about multiply data structures for
> > Dictionary following your suggestions. Based on the test results, I think
> > DAT may be the best choice for CarbonData.
> >
> > *1. Here are 2 test results:*
> > -----------------------------------------------------------------------
> > Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for
> Dictionary
> >   HashMap :                               java.util.HashMap
> >   DAT (Double Array Trie):
> > https://github.com/komiya-atsushi/darts-java
> >   RadixTree:
> > https://github.com/npgall/concurrent-trees
> >   TrieDict (Dictionary in Kylin):
> > http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> > Dictionary Source (Traditional Chinese):
> > https://raw.githubusercontent.com/fxsjy/jieba/master/extra_
> > dict/dict.txt.big
> > ================Test Result================
> > a. Dictionary Size:584429
> > --------
> > b. Build Time (ms) :
> >    DAT       : 5714
> >    HashMap   : 110
> >    RadixTree : 22044
> >    TrieDict  : 855
> > --------
> > c. Memory footprint in 64-bit JVM (bytes) :
> >    DAT       : 16779752
> >    HashMap   : 32196592
> >    RadixTree : 46130584
> >    TrieDict  : 10443608
> > --------
> > d. Retrieval Performance for 9935293 query times (ms) :
> >    DAT       : 585
> >    HashMap   : 1010
> >    RadixTree : 417639
> >    TrieDict  : 8664
> > ================Test Result================
> >
> > ================Test Result================
> > a. Dictionary Size:584429
> > --------
> > b. Build Time (ms) :
> >    DAT       : 5867
> >    HashMap   : 100
> >    RadixTree : 22082
> >    TrieDict  : 840
> > --------
> > c. Memory footprint in 64-bit JVM (bytes) :
> >    DAT       : 16779752
> >    HashMap   : 32196592
> >    RadixTree : 46130584
> >    TrieDict  : 10443608
> > --------
> > d. Retrieval Performance for 9935293 query times (ms) :
> >    DAT       : 593
> >    HashMap   : 821
> >    RadixTree : 422297
> >    TrieDict  : 8752
> > ================Test Result================
> >
> > *2. Conclusion:*
> > a. TrieDict is good for building tree and less memory footprint overhead,
> > but worst retrieval performance,
> > b. DAT is a good tradeoff between memory footprint and retrieval
> > performance,
> > c. RadixTree has the worst performance in different aspects.
> >
> > *3. Result Analysis:*
> > a. With Trie the memory footprint of the TrieDict mapping is kinda
> > minimized if compared to HashMap, in order to improve performance there
> is
> > a cache layer overlays on top of Trie.
> > b. Because a large number of duplicate prefix data, the total memory
> > footprint is more than trie, meanwhile i think calculating string hash
> code
> > of traditional Chinese consume considerable time overhead, so the
> > performance is not the best.
> > c. DAT is a better tradeoff.
> > d. I have no idea why RadixTree has the worst performance in terms of
> > memory, retrieval and building tree.
> >
> >
> > On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[hidden email]>
> > wrote:
> >
> > > Hi xiaoqiao
> > >
> > > ok, look forward to seeing your test result.
> > > Can you take this task for this improvement? Please let me know if you
> > need
> > > any support :)
> > >
> > > Regards
> > > Liang
> > >
> > >
> > > hexiaoqiao wrote
> > > > Hi Kumar Vishal,
> > > >
> > > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> > we
> > > > can get better memory footprint and also good performance. Of course,
> > DAT
> > > > is not only choice, and I will do test about DAT vs Radix Trie and
> > > release
> > > > the test result as soon as possible. Thanks your suggestions again.
> > > >
> > > > Regards,
> > > > Xiaoqiao
> > > >
> > > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal &lt;
> > >
> > > > kumarvishal1802@
> > >
> > > > &gt;
> > > > wrote:
> > > >
> > > >> Hi XIaoqiao He,
> > > >> +1,
> > > >> For forward dictionary case it will be very good optimisation, as
> our
> > > >> case
> > > >> is very specific storing byte array to int mapping[data to surrogate
> > key
> > > >> mapping], I think we will get much better memory footprint and
> > > >> performance
> > > >> will be also good(2x). We can also try radix tree(radix trie), it is
> > > more
> > > >> optimise for storage.
> > > >>
> > > >> -Regards
> > > >> Kumar Vishal
> > > >>
> > > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen &lt;
> > >
> > > > chenliang6136@
> > >
> > > > &gt;
> > > >> wrote:
> > > >>
> > > >> > Hi xiaoqiao
> > > >> >
> > > >> > For the below example, 600K dictionary data:
> > > >> > It is to say that using "DAT" can save 36M memory against
> > > >> > "ConcurrentHashMap", whereas the performance just lost less
> > (1718ms) ?
> > > >> >
> > > >> > One more question:if increases the dictionary data size, what's
> the
> > > >> > comparison results "ConcurrentHashMap" VS "DAT"
> > > >> >
> > > >> > Regards
> > > >> > Liang
> > > >> > ------------------------------------------------------------
> > > >> > ------------------------------------------
> > > >> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > > >> >
> > > >> > b. retrieval performance: total time(ms) of 500 million query:
> > > >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > > >> >
> > > >> > Regards
> > > >> > Liang
> > > >> >
> > > >> > hexiaoqiao wrote
> > > >> > > hi Liang,
> > > >> > >
> > > >> > > Thanks for your reply, i need to correct the experiment result
> > > >> because
> > > >> > > it's
> > > >> > > wrong order NO.1 column of result data table.
> > > >> > >
> > > >> > > In order to compare performance between Trie and HashMap, Two
> > > >> different
> > > >> > > structures are constructed using the same dictionary data which
> > size
> > > >> is
> > > >> > > 600K and each item's length is between 2 and 50 bytes.
> > > >> > >
> > > >> > > ConcurrentHashMap (structure which is used in CarbonData
> > currently)
> > > >> vs
> > > >> > > Double
> > > >> > > Array Trie (one implementation of Trie Structures)
> > > >> > >
> > > >> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > > >> > >
> > > >> > > b. retrieval performance: total time(ms) of 500 million query:
> > > >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > > >> > >
> > > >> > > Regards,
> > > >> > > He Xiaoqiao
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
> > > >> >
> > > >> > > chenliang6136@
> > > >> >
> > > >> > > &gt; wrote:
> > > >> > >
> > > >> > >> Hi xiaoqiao
> > > >> > >>
> > > >> > >> This improvement looks great!
> > > >> > >> Can you please explain the below data, what does it mean?
> > > >> > >> ----------
> > > >> > >> ConcurrentHashMap
> > > >> > >> ~68MB 14543
> > > >> > >> Double Array Trie
> > > >> > >> ~104MB 12825
> > > >> > >>
> > > >> > >> Regards
> > > >> > >> Liang
> > > >> > >>
> > > >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
> > > >> >
> > > >> > > xq.he2009@
> > > >> >
> > > >> > > &gt;:
> > > >> > >>
> > > >> > >> >  Hi All,
> > > >> > >> >
> > > >> > >> > I would like to propose Dictionary improvement which using
> Trie
> > > in
> > > >> > >> place
> > > >> > >> of
> > > >> > >> > HashMap.
> > > >> > >> >
> > > >> > >> > In order to speedup aggregation, reduce run-time memory
> > > footprint,
> > > >> > >> enable
> > > >> > >> > fast
> > > >> > >> > distinct count etc, CarbonData encodes data using dictionary
> at
> > > >> file
> > > >> > >> level
> > > >> > >> > or table level based on cardinality. It is a general and
> > > efficient
> > > >> way
> > > >> > >> in
> > > >> > >> > many big data systems, but when apply ConcurrentHashMap
> > > >> > >> > to maintain Dictionary in CarbonData currently, memory
> overhead
> > > of
> > > >> > >> > Driver is very huge since it has to load whole Dictionary to
> > > >> decode
> > > >> > >> actual
> > > >> > >> > data value, especially column cardinality is a large number.
> > and
> > > >> > >> CarbonData
> > > >> > >> > will not do dictionary if cardinality > 1 million at default
> > > >> behavior.
> > > >> > >> >
> > > >> > >> > I propose using Trie in place of HashMap for the following
> > three
> > > >> > >> reasons:
> > > >> > >> > (1) Trie is a proper structure for Dictionary,
> > > >> > >> > (2) Reduce memory footprint,
> > > >> > >> > (3) Not impact retrieval performance
> > > >> > >> >
> > > >> > >> > The experimental results show that Trie is able to meet the
> > > >> > >> requirement.
> > > >> > >> > a. ConcurrentHashMap vs Double Array Trie
> > > >> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> > > >> > >> implementation of
> > > >> > >> > Trie Structures)
> > > >> > >> > b. Dictionary size: 600K
> > > >> > >> > c. Memory footprint and query time
> > > >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > > >> > >> > ConcurrentHashMap
> > > >> > >> > ~68MB 14543
> > > >> > >> > Double Array Trie
> > > >> > >> > ~104MB 12825
> > > >> > >> >
> > > >> > >> > Please share your suggestions about the proposed improvement
> of
> > > >> > >> Dictionary.
> > > >> > >> >
> > > >> > >> > Regards
> > > >> > >> > He Xiaoqiao
> > > >> > >> >
> > > >> > >>
> > > >> > >>
> > > >> > >>
> > > >> > >> --
> > > >> > >> Regards
> > > >> > >> Liang
> > > >> > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > View this message in context: http://apache-carbondata-
> > > >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > > >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > > >> > Dictionary-tp3132p3143.html
> > > >> > Sent from the Apache CarbonData Mailing List archive mailing list
> > > >> archive
> > > >> > at Nabble.com.
> > > >> >
> > > >>
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-carbondata-
> > > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > > Dictionary-tp3132p3186.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Jihong Ma
In reply to this post by hexiaoqiao
Thank you Xiaoqiao for looking into this issue and sharing your result!

Have you tried varied dictionary size for comparison among all the alternatives?

And please pay closer attention to the license of DAT implementation, as they are under LGPL, generally speaking, it is not legally allowed to be included.

Jihong

-----Original Message-----
From: Xiaoqiao He [mailto:[hidden email]]
Sent: Friday, November 25, 2016 9:52 AM
To: [hidden email]
Subject: Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Hi Liang, Kumar Vishal,

I has done a standard benchmark about multiply data structures for
Dictionary following your suggestions. Based on the test results, I think
DAT may be the best choice for CarbonData.

*1. Here are 2 test results:*
-----------------------------------------------------------------------
Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
  HashMap :                               java.util.HashMap
  DAT (Double Array Trie):
https://github.com/komiya-atsushi/darts-java
  RadixTree:
https://github.com/npgall/concurrent-trees
  TrieDict (Dictionary in Kylin):
http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
Dictionary Source (Traditional Chinese):
https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big
================Test Result================
a. Dictionary Size:584429
--------
b. Build Time (ms) :
   DAT       : 5714
   HashMap   : 110
   RadixTree : 22044
   TrieDict  : 855
--------
c. Memory footprint in 64-bit JVM (bytes) :
   DAT       : 16779752
   HashMap   : 32196592
   RadixTree : 46130584
   TrieDict  : 10443608
--------
d. Retrieval Performance for 9935293 query times (ms) :
   DAT       : 585
   HashMap   : 1010
   RadixTree : 417639
   TrieDict  : 8664
================Test Result================

================Test Result================
a. Dictionary Size:584429
--------
b. Build Time (ms) :
   DAT       : 5867
   HashMap   : 100
   RadixTree : 22082
   TrieDict  : 840
--------
c. Memory footprint in 64-bit JVM (bytes) :
   DAT       : 16779752
   HashMap   : 32196592
   RadixTree : 46130584
   TrieDict  : 10443608
--------
d. Retrieval Performance for 9935293 query times (ms) :
   DAT       : 593
   HashMap   : 821
   RadixTree : 422297
   TrieDict  : 8752
================Test Result================

*2. Conclusion:*
a. TrieDict is good for building tree and less memory footprint overhead,
but worst retrieval performance,
b. DAT is a good tradeoff between memory footprint and retrieval
performance,
c. RadixTree has the worst performance in different aspects.

*3. Result Analysis:*
a. With Trie the memory footprint of the TrieDict mapping is kinda
minimized if compared to HashMap, in order to improve performance there is
a cache layer overlays on top of Trie.
b. Because a large number of duplicate prefix data, the total memory
footprint is more than trie, meanwhile i think calculating string hash code
of traditional Chinese consume considerable time overhead, so the
performance is not the best.
c. DAT is a better tradeoff.
d. I have no idea why RadixTree has the worst performance in terms of
memory, retrieval and building tree.


On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[hidden email]>
wrote:

> Hi xiaoqiao
>
> ok, look forward to seeing your test result.
> Can you take this task for this improvement? Please let me know if you need
> any support :)
>
> Regards
> Liang
>
>
> hexiaoqiao wrote
> > Hi Kumar Vishal,
> >
> > Thanks for your suggestions. As you said, choose Trie replace HashMap we
> > can get better memory footprint and also good performance. Of course, DAT
> > is not only choice, and I will do test about DAT vs Radix Trie and
> release
> > the test result as soon as possible. Thanks your suggestions again.
> >
> > Regards,
> > Xiaoqiao
> >
> > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal &lt;
>
> > kumarvishal1802@
>
> > &gt;
> > wrote:
> >
> >> Hi XIaoqiao He,
> >> +1,
> >> For forward dictionary case it will be very good optimisation, as our
> >> case
> >> is very specific storing byte array to int mapping[data to surrogate key
> >> mapping], I think we will get much better memory footprint and
> >> performance
> >> will be also good(2x). We can also try radix tree(radix trie), it is
> more
> >> optimise for storage.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen &lt;
>
> > chenliang6136@
>
> > &gt;
> >> wrote:
> >>
> >> > Hi xiaoqiao
> >> >
> >> > For the below example, 600K dictionary data:
> >> > It is to say that using "DAT" can save 36M memory against
> >> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >> >
> >> > One more question:if increases the dictionary data size, what's the
> >> > comparison results "ConcurrentHashMap" VS "DAT"
> >> >
> >> > Regards
> >> > Liang
> >> > ------------------------------------------------------------
> >> > ------------------------------------------
> >> > a. memory footprint (approximate quantity) in 64-bit JVM:
> >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >> >
> >> > b. retrieval performance: total time(ms) of 500 million query:
> >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >> >
> >> > Regards
> >> > Liang
> >> >
> >> > hexiaoqiao wrote
> >> > > hi Liang,
> >> > >
> >> > > Thanks for your reply, i need to correct the experiment result
> >> because
> >> > > it's
> >> > > wrong order NO.1 column of result data table.
> >> > >
> >> > > In order to compare performance between Trie and HashMap, Two
> >> different
> >> > > structures are constructed using the same dictionary data which size
> >> is
> >> > > 600K and each item's length is between 2 and 50 bytes.
> >> > >
> >> > > ConcurrentHashMap (structure which is used in CarbonData currently)
> >> vs
> >> > > Double
> >> > > Array Trie (one implementation of Trie Structures)
> >> > >
> >> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >> > >
> >> > > b. retrieval performance: total time(ms) of 500 million query:
> >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >> > >
> >> > > Regards,
> >> > > He Xiaoqiao
> >> > >
> >> > >
> >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
> >> >
> >> > > chenliang6136@
> >> >
> >> > > &gt; wrote:
> >> > >
> >> > >> Hi xiaoqiao
> >> > >>
> >> > >> This improvement looks great!
> >> > >> Can you please explain the below data, what does it mean?
> >> > >> ----------
> >> > >> ConcurrentHashMap
> >> > >> ~68MB 14543
> >> > >> Double Array Trie
> >> > >> ~104MB 12825
> >> > >>
> >> > >> Regards
> >> > >> Liang
> >> > >>
> >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
> >> >
> >> > > xq.he2009@
> >> >
> >> > > &gt;:
> >> > >>
> >> > >> >  Hi All,
> >> > >> >
> >> > >> > I would like to propose Dictionary improvement which using Trie
> in
> >> > >> place
> >> > >> of
> >> > >> > HashMap.
> >> > >> >
> >> > >> > In order to speedup aggregation, reduce run-time memory
> footprint,
> >> > >> enable
> >> > >> > fast
> >> > >> > distinct count etc, CarbonData encodes data using dictionary at
> >> file
> >> > >> level
> >> > >> > or table level based on cardinality. It is a general and
> efficient
> >> way
> >> > >> in
> >> > >> > many big data systems, but when apply ConcurrentHashMap
> >> > >> > to maintain Dictionary in CarbonData currently, memory overhead
> of
> >> > >> > Driver is very huge since it has to load whole Dictionary to
> >> decode
> >> > >> actual
> >> > >> > data value, especially column cardinality is a large number. and
> >> > >> CarbonData
> >> > >> > will not do dictionary if cardinality > 1 million at default
> >> behavior.
> >> > >> >
> >> > >> > I propose using Trie in place of HashMap for the following three
> >> > >> reasons:
> >> > >> > (1) Trie is a proper structure for Dictionary,
> >> > >> > (2) Reduce memory footprint,
> >> > >> > (3) Not impact retrieval performance
> >> > >> >
> >> > >> > The experimental results show that Trie is able to meet the
> >> > >> requirement.
> >> > >> > a. ConcurrentHashMap vs Double Array Trie
> >> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> >> > >> implementation of
> >> > >> > Trie Structures)
> >> > >> > b. Dictionary size: 600K
> >> > >> > c. Memory footprint and query time
> >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > >> > ConcurrentHashMap
> >> > >> > ~68MB 14543
> >> > >> > Double Array Trie
> >> > >> > ~104MB 12825
> >> > >> >
> >> > >> > Please share your suggestions about the proposed improvement of
> >> > >> Dictionary.
> >> > >> >
> >> > >> > Regards
> >> > >> > He Xiaoqiao
> >> > >> >
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Regards
> >> > >> Liang
> >> > >>
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://apache-carbondata-
> >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> >> > Dictionary-tp3132p3143.html
> >> > Sent from the Apache CarbonData Mailing List archive mailing list
> >> archive
> >> > at Nabble.com.
> >> >
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> Dictionary-tp3132p3186.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

hexiaoqiao
Hi Jihong,

Thanks for your attentions and reply.
1. Actually I has done benchmark with English/Chinese dictionary size in
{100K,200K,300K,400K,500K,600K} separately, and test result is basic same
as mentioned in this mail flow before, I will submit and open the benchmark
code and dictionary source to github
<https://github.com/Hexiaoqiao/bigdata_algorithm_benchmark> as soon as
possible.
2. I also notice the license of DAT
<https://github.com/komiya-atsushi/darts-java>, and i think it's necessary
to re-implement another DAT following this paper:
https://linux.thai.net/~thep/datrie/datrie.html.

All kinds of suggestions are welcomed.

Regards,
He Xiaoqiao


On Tue, Nov 29, 2016 at 5:17 AM, Jihong Ma <[hidden email]> wrote:

> Thank you Xiaoqiao for looking into this issue and sharing your result!
>
> Have you tried varied dictionary size for comparison among all the
> alternatives?
>
> And please pay closer attention to the license of DAT implementation, as
> they are under LGPL, generally speaking, it is not legally allowed to be
> included.
>
> Jihong
>
> -----Original Message-----
> From: Xiaoqiao He [mailto:[hidden email]]
> Sent: Friday, November 25, 2016 9:52 AM
> To: [hidden email]
> Subject: Re: [Improvement] Use Trie in place of HashMap to reduce memory
> footprint of Dictionary
>
> Hi Liang, Kumar Vishal,
>
> I has done a standard benchmark about multiply data structures for
> Dictionary following your suggestions. Based on the test results, I think
> DAT may be the best choice for CarbonData.
>
> *1. Here are 2 test results:*
> -----------------------------------------------------------------------
> Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
>   HashMap :                               java.util.HashMap
>   DAT (Double Array Trie):
> https://github.com/komiya-atsushi/darts-java
>   RadixTree:
> https://github.com/npgall/concurrent-trees
>   TrieDict (Dictionary in Kylin):
> http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> Dictionary Source (Traditional Chinese):
> https://raw.githubusercontent.com/fxsjy/jieba/master/extra_d
> ict/dict.txt.big
> ================Test Result================
> a. Dictionary Size:584429
> --------
> b. Build Time (ms) :
>    DAT       : 5714
>    HashMap   : 110
>    RadixTree : 22044
>    TrieDict  : 855
> --------
> c. Memory footprint in 64-bit JVM (bytes) :
>    DAT       : 16779752
>    HashMap   : 32196592
>    RadixTree : 46130584
>    TrieDict  : 10443608
> --------
> d. Retrieval Performance for 9935293 query times (ms) :
>    DAT       : 585
>    HashMap   : 1010
>    RadixTree : 417639
>    TrieDict  : 8664
> ================Test Result================
>
> ================Test Result================
> a. Dictionary Size:584429
> --------
> b. Build Time (ms) :
>    DAT       : 5867
>    HashMap   : 100
>    RadixTree : 22082
>    TrieDict  : 840
> --------
> c. Memory footprint in 64-bit JVM (bytes) :
>    DAT       : 16779752
>    HashMap   : 32196592
>    RadixTree : 46130584
>    TrieDict  : 10443608
> --------
> d. Retrieval Performance for 9935293 query times (ms) :
>    DAT       : 593
>    HashMap   : 821
>    RadixTree : 422297
>    TrieDict  : 8752
> ================Test Result================
>
> *2. Conclusion:*
> a. TrieDict is good for building tree and less memory footprint overhead,
> but worst retrieval performance,
> b. DAT is a good tradeoff between memory footprint and retrieval
> performance,
> c. RadixTree has the worst performance in different aspects.
>
> *3. Result Analysis:*
> a. With Trie the memory footprint of the TrieDict mapping is kinda
> minimized if compared to HashMap, in order to improve performance there is
> a cache layer overlays on top of Trie.
> b. Because a large number of duplicate prefix data, the total memory
> footprint is more than trie, meanwhile i think calculating string hash code
> of traditional Chinese consume considerable time overhead, so the
> performance is not the best.
> c. DAT is a better tradeoff.
> d. I have no idea why RadixTree has the worst performance in terms of
> memory, retrieval and building tree.
>
>
> On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[hidden email]>
> wrote:
>
> > Hi xiaoqiao
> >
> > ok, look forward to seeing your test result.
> > Can you take this task for this improvement? Please let me know if you
> need
> > any support :)
> >
> > Regards
> > Liang
> >
> >
> > hexiaoqiao wrote
> > > Hi Kumar Vishal,
> > >
> > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> we
> > > can get better memory footprint and also good performance. Of course,
> DAT
> > > is not only choice, and I will do test about DAT vs Radix Trie and
> > release
> > > the test result as soon as possible. Thanks your suggestions again.
> > >
> > > Regards,
> > > Xiaoqiao
> > >
> > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal &lt;
> >
> > > kumarvishal1802@
> >
> > > &gt;
> > > wrote:
> > >
> > >> Hi XIaoqiao He,
> > >> +1,
> > >> For forward dictionary case it will be very good optimisation, as our
> > >> case
> > >> is very specific storing byte array to int mapping[data to surrogate
> key
> > >> mapping], I think we will get much better memory footprint and
> > >> performance
> > >> will be also good(2x). We can also try radix tree(radix trie), it is
> > more
> > >> optimise for storage.
> > >>
> > >> -Regards
> > >> Kumar Vishal
> > >>
> > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen &lt;
> >
> > > chenliang6136@
> >
> > > &gt;
> > >> wrote:
> > >>
> > >> > Hi xiaoqiao
> > >> >
> > >> > For the below example, 600K dictionary data:
> > >> > It is to say that using "DAT" can save 36M memory against
> > >> > "ConcurrentHashMap", whereas the performance just lost less
> (1718ms) ?
> > >> >
> > >> > One more question:if increases the dictionary data size, what's the
> > >> > comparison results "ConcurrentHashMap" VS "DAT"
> > >> >
> > >> > Regards
> > >> > Liang
> > >> > ------------------------------------------------------------
> > >> > ------------------------------------------
> > >> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >> >
> > >> > b. retrieval performance: total time(ms) of 500 million query:
> > >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >> >
> > >> > Regards
> > >> > Liang
> > >> >
> > >> > hexiaoqiao wrote
> > >> > > hi Liang,
> > >> > >
> > >> > > Thanks for your reply, i need to correct the experiment result
> > >> because
> > >> > > it's
> > >> > > wrong order NO.1 column of result data table.
> > >> > >
> > >> > > In order to compare performance between Trie and HashMap, Two
> > >> different
> > >> > > structures are constructed using the same dictionary data which
> size
> > >> is
> > >> > > 600K and each item's length is between 2 and 50 bytes.
> > >> > >
> > >> > > ConcurrentHashMap (structure which is used in CarbonData
> currently)
> > >> vs
> > >> > > Double
> > >> > > Array Trie (one implementation of Trie Structures)
> > >> > >
> > >> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >> > >
> > >> > > b. retrieval performance: total time(ms) of 500 million query:
> > >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >> > >
> > >> > > Regards,
> > >> > > He Xiaoqiao
> > >> > >
> > >> > >
> > >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
> > >> >
> > >> > > chenliang6136@
> > >> >
> > >> > > &gt; wrote:
> > >> > >
> > >> > >> Hi xiaoqiao
> > >> > >>
> > >> > >> This improvement looks great!
> > >> > >> Can you please explain the below data, what does it mean?
> > >> > >> ----------
> > >> > >> ConcurrentHashMap
> > >> > >> ~68MB 14543
> > >> > >> Double Array Trie
> > >> > >> ~104MB 12825
> > >> > >>
> > >> > >> Regards
> > >> > >> Liang
> > >> > >>
> > >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
> > >> >
> > >> > > xq.he2009@
> > >> >
> > >> > > &gt;:
> > >> > >>
> > >> > >> >  Hi All,
> > >> > >> >
> > >> > >> > I would like to propose Dictionary improvement which using Trie
> > in
> > >> > >> place
> > >> > >> of
> > >> > >> > HashMap.
> > >> > >> >
> > >> > >> > In order to speedup aggregation, reduce run-time memory
> > footprint,
> > >> > >> enable
> > >> > >> > fast
> > >> > >> > distinct count etc, CarbonData encodes data using dictionary at
> > >> file
> > >> > >> level
> > >> > >> > or table level based on cardinality. It is a general and
> > efficient
> > >> way
> > >> > >> in
> > >> > >> > many big data systems, but when apply ConcurrentHashMap
> > >> > >> > to maintain Dictionary in CarbonData currently, memory overhead
> > of
> > >> > >> > Driver is very huge since it has to load whole Dictionary to
> > >> decode
> > >> > >> actual
> > >> > >> > data value, especially column cardinality is a large number.
> and
> > >> > >> CarbonData
> > >> > >> > will not do dictionary if cardinality > 1 million at default
> > >> behavior.
> > >> > >> >
> > >> > >> > I propose using Trie in place of HashMap for the following
> three
> > >> > >> reasons:
> > >> > >> > (1) Trie is a proper structure for Dictionary,
> > >> > >> > (2) Reduce memory footprint,
> > >> > >> > (3) Not impact retrieval performance
> > >> > >> >
> > >> > >> > The experimental results show that Trie is able to meet the
> > >> > >> requirement.
> > >> > >> > a. ConcurrentHashMap vs Double Array Trie
> > >> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> > >> > >> implementation of
> > >> > >> > Trie Structures)
> > >> > >> > b. Dictionary size: 600K
> > >> > >> > c. Memory footprint and query time
> > >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > >> > >> > ConcurrentHashMap
> > >> > >> > ~68MB 14543
> > >> > >> > Double Array Trie
> > >> > >> > ~104MB 12825
> > >> > >> >
> > >> > >> > Please share your suggestions about the proposed improvement of
> > >> > >> Dictionary.
> > >> > >> >
> > >> > >> > Regards
> > >> > >> > He Xiaoqiao
> > >> > >> >
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> --
> > >> > >> Regards
> > >> > >> Liang
> > >> > >>
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > View this message in context: http://apache-carbondata-
> > >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > >> > Dictionary-tp3132p3143.html
> > >> > Sent from the Apache CarbonData Mailing List archive mailing list
> > >> archive
> > >> > at Nabble.com.
> > >> >
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > Dictionary-tp3132p3186.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>