Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] incubator-carbondata pull request #644: [CARBONDATA-757]Big decimal optimiza...

Classic

List

Threaded

4 messages Options

qiuchenjian-2

[GitHub] incubator-carbondata pull request #644: [CARBONDATA-757]Big decimal optimiza...

GitHub user ravipesala opened a pull request:

https://github.com/apache/incubator-carbondata/pull/644

[CARBONDATA-757]Big decimal optimization

Currently Decimal is converted to bytes and using LV (length + value) format to write to store. And while getting back read the bytes in LV format and convert back the bigdecimal.
We can do following operations to improve storage and processing.
1. if decimal precision is less than 9 then we can fit in int (4 bytes)
2. if decimal precision is less than 18 then we can fit in long (8 bytes)
3. if decimal precision is more than 18 then we can fit in fixed length bytes(the length bytes can vary depends on precision but it is always fixed length)
So in this approach we no need store bigdecimal in LV format, we can store in fixed format.It reduces the memory.

Carbondata format changes -> Added fixedLength in datachunk to know about the column length of big decimal. This attribute can be used in case of char(fixedlength) or varchar(fixedlength) datatypes as well.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata bigdecimal-optimize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/644.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #644

----
commit 241c032f3e54facb59ba0b946f3c0c0c67dab59c
Author: ravipesala <[hidden email]>
Date: 2017-03-09T12:42:47Z

BigDecimal optimization

----

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

qiuchenjian-2

[GitHub] incubator-carbondata issue #644: [CARBONDATA-757]Big decimal optimization

Github user ravipesala commented on the issue:

https://github.com/apache/incubator-carbondata/pull/644

Build will not compile as there are carbon-format changes.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

qiuchenjian-2

[GitHub] incubator-carbondata issue #644: [CARBONDATA-757]Big decimal optimization

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/644

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1082/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

qiuchenjian-2

[GitHub] incubator-carbondata issue #644: [CARBONDATA-757]Big decimal optimization

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/incubator-carbondata/pull/644

Test results witj 100 million data
**DDL**
CREATE TABLE perftesta (c1 string,c2 string,c3 string,c4 string,c5 string,c6 bigint,c7 decimal(7,2),c8 int,c9 decimal(7,2),c10 decimal(15,2)) STORED BY 'carbondata'

**Queries**
Q1 -> SELECT count(c1),count(c2),count(c3),count(c4),count(c5),count(c6),count(c7),count(c8),count(c9),count(c10) FROM perftesta99;
Q2 -> SELECT sum(c7), sum(c8), sum(9), sum(c10) FROM perftesta99 WHERE c2="P2_75" and c7<5;
Q3 -> SELECT c2, c5, count(distinct c1), sum(c7) FROM perftesta99 WHERE c4="P4_4" and c5="P5_7" GROUP BY c2, c5;

**Master Code**
Load time -> 576 seconds
Data size after load -> 1800MB
Query(first_reading, second_reading)
Q1(25.27, 21.794)
Q2(27.296, 28.21)
Q3(7.383, 5.103)

**This PR Code**
Load time -> 431 seconds
Data size after load -> 1720MB
Query(first_reading, second_reading)
Q1(18.507,14.427)
Q2(24.102, 23.322)
Q3(6.87,5.079)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---