Apache CarbonData Dev Mailing List archive

CarbonData Performance Optimization

Classic

List

Threaded

6 messages Options

ravipesala

CarbonData Performance Optimization

Hi,

In case of querying data using Spark or Presto, carbondata is not well
optimized for reading data and fill the vector. The major issues are as
follows.
1. CarbonData has long method stack for reading and filling out the data to
vector.
2. Many conditions and checks before filling out the data to vector.
3. Maintaining intermediate copies of data leads more CPU utilization.
Because of the above issues, there is a high chance of missing the CPU
cache while processing the leads to poor performance.

So here I am proposing the optimization to fill the vector without much
method stack and condition checks and no intermediate copies to utilize
more CPU cache.

*Full Scan queries:*
After decompressing the page in our V3 reader we can immediately fill the
data to a vector without any condition checks inside loops. So here
complete column page data is set to column vector in a single batch and
gives back data to Spark/Presto.
*Filter Queries:*
First, apply page level pruning using the min/max of each page and get
the valid pages of blocklet. Decompress only valid pages and fill the
vector directly as mentioned in full scan query scenario.

In this method, we can also get the advantage of avoiding two times
filtering in Spark/Presto as they do the filtering again even though we
return the filtered data.

Please find the *TPCH performance report of updated carbon* as per the
changes mentioned above. Please note that the changes I have done the
changes in POC quality so it takes some time to stabilize it.

*Configurations*
Laptop with i7 processor and 16 GB RAM.
TPCH Data Scale: 100 GB
No Sort with no inverted index data.
Total CarbonData Size : 32 GB
Total Parquet Size : 31 GB

Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New
Vs Parquet Carbon old Vs Parquet
Q1 101 96 128 25.00% 4.95% -26.73%
Q2 85 82 85 3.53% 3.53% 0.00%
Q3 118 112 135 17.04% 5.08% -14.41%
Q4 473 424 486 12.76% 10.36% -2.75%
Q5 228 201 205 1.95% 11.84% 10.09%
Q6 19.2 19.2 48 60.00% 0.00% -150.00%
Q7 194 181 198 8.59% 6.70% -2.06%
Q8 285 263 275 4.36% 7.72% 3.51%
Q9 362 345 363 4.96% 4.70% -0.28%
Q10 101 92 93 1.08% 8.91% 7.92%
Q11 64 61 62 1.61% 4.69% 3.13%
Q12 41.4 44 63 30.16% -6.28% -52.17%
Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
Q14 36.9 31.5 41 23.17% 14.63% -11.11%
Q15 70 59 80 26.25% 15.71% -14.29%
Q16 64 60 64 6.25% 6.25% 0.00%
Q17 426 418 432 3.24% 1.88% -1.41%
Q18 1015 921 1001 7.99% 9.26% 1.38%
Q19 62 53 59 10.17% 14.52% 4.84%
Q20 406 326 426 23.47% 19.70% -4.93%
Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
*Full Scan Query means count of every coumn of lineitem, In this way we can
check the full scan query performance.

The above optimization is not just limited to fileformat and Presto
integration but also improves for CarbonSession integration.
We can further optimize carbon by the tasks(Vishal is already working on
it) like adaptive encoding for all types of columns and storing length and
values in separate pages in case of string datatype.Please refer
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
.

--
Thanks & Regards,
Ravi

Liang Chen

Re: CarbonData Performance Optimization

Administrator

Hi

+1, great proposal, very expect to see your pull request.

Regards
Liang

ravipesala wrote

> Hi,
>
> In case of querying data using Spark or Presto, carbondata is not well
> optimized for reading data and fill the vector. The major issues are as
> follows.
> 1. CarbonData has long method stack for reading and filling out the data
> to
> vector.
> 2. Many conditions and checks before filling out the data to vector.
> 3. Maintaining intermediate copies of data leads more CPU utilization.
> Because of the above issues, there is a high chance of missing the CPU
> cache while processing the leads to poor performance.
>
> So here I am proposing the optimization to fill the vector without much
> method stack and condition checks and no intermediate copies to utilize
> more CPU cache.
>
> *Full Scan queries:*
> After decompressing the page in our V3 reader we can immediately fill
> the
> data to a vector without any condition checks inside loops. So here
> complete column page data is set to column vector in a single batch and
> gives back data to Spark/Presto.
> *Filter Queries:*
> First, apply page level pruning using the min/max of each page and get
> the valid pages of blocklet. Decompress only valid pages and fill the
> vector directly as mentioned in full scan query scenario.
>
> In this method, we can also get the advantage of avoiding two times
> filtering in Spark/Presto as they do the filtering again even though we
> return the filtered data.
>
> Please find the *TPCH performance report of updated carbon* as per the
> changes mentioned above. Please note that the changes I have done the
> changes in POC quality so it takes some time to stabilize it.
>
> *Configurations*
> Laptop with i7 processor and 16 GB RAM.
> TPCH Data Scale: 100 GB
> No Sort with no inverted index data.
> Total CarbonData Size : 32 GB
> Total Parquet Size : 31 GB
>
>
> Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New
> Vs Parquet Carbon old Vs Parquet
> Q1 101 96 128 25.00% 4.95% -26.73%
> Q2 85 82 85 3.53% 3.53% 0.00%
> Q3 118 112 135 17.04% 5.08% -14.41%
> Q4 473 424 486 12.76% 10.36% -2.75%
> Q5 228 201 205 1.95% 11.84% 10.09%
> Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> Q7 194 181 198 8.59% 6.70% -2.06%
> Q8 285 263 275 4.36% 7.72% 3.51%
> Q9 362 345 363 4.96% 4.70% -0.28%
> Q10 101 92 93 1.08% 8.91% 7.92%
> Q11 64 61 62 1.61% 4.69% 3.13%
> Q12 41.4 44 63 30.16% -6.28% -52.17%
> Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> Q15 70 59 80 26.25% 15.71% -14.29%
> Q16 64 60 64 6.25% 6.25% 0.00%
> Q17 426 418 432 3.24% 1.88% -1.41%
> Q18 1015 921 1001 7.99% 9.26% 1.38%
> Q19 62 53 59 10.17% 14.52% 4.84%
> Q20 406 326 426 23.47% 19.70% -4.93%
> Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> *Full Scan Query means count of every coumn of lineitem, In this way we
> can
> check the full scan query performance.
>
> The above optimization is not just limited to fileformat and Presto
> integration but also improves for CarbonSession integration.
> We can further optimize carbon by the tasks(Vishal is already working on
> it) like adaptive encoding for all types of columns and storing length and
> values in separate pages in case of string datatype.Please refer
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> .
>
> --
> Thanks & Regards,
> Ravi

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Jacky Li

Re: CarbonData Performance Optimization

In reply to this post by ravipesala

> 在 2018年9月21日，上午10:20，Ravindra Pesala <[hidden email]> 写道：
>
> Hi,
>
> In case of querying data using Spark or Presto, carbondata is not well
> optimized for reading data and fill the vector. The major issues are as
> follows.
> 1. CarbonData has long method stack for reading and filling out the data to
> vector.
> 2. Many conditions and checks before filling out the data to vector.
> 3. Maintaining intermediate copies of data leads more CPU utilization.
> Because of the above issues, there is a high chance of missing the CPU
> cache while processing the leads to poor performance.
>
> So here I am proposing the optimization to fill the vector without much
> method stack and condition checks and no intermediate copies to utilize
> more CPU cache.
>
> *Full Scan queries:*
> After decompressing the page in our V3 reader we can immediately fill the
> data to a vector without any condition checks inside loops. So here
> complete column page data is set to column vector in a single batch and
> gives back data to Spark/Presto.
> *Filter Queries:*
> First, apply page level pruning using the min/max of each page and get
> the valid pages of blocklet. Decompress only valid pages and fill the
> vector directly as mentioned in full scan query scenario.
>
> In this method, we can also get the advantage of avoiding two times
> filtering in Spark/Presto as they do the filtering again even though we
> return the filtered data.
>
> Please find the *TPCH performance report of updated carbon* as per the
> changes mentioned above. Please note that the changes I have done the
> changes in POC quality so it takes some time to stabilize it.
>
> *Configurations*
> Laptop with i7 processor and 16 GB RAM.
> TPCH Data Scale: 100 GB
> No Sort with no inverted index data.
> Total CarbonData Size : 32 GB
> Total Parquet Size : 31 GB
>
>
> Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New
> Vs Parquet Carbon old Vs Parquet
> Q1 101 96 128 25.00% 4.95% -26.73%
> Q2 85 82 85 3.53% 3.53% 0.00%
> Q3 118 112 135 17.04% 5.08% -14.41%
> Q4 473 424 486 12.76% 10.36% -2.75%
> Q5 228 201 205 1.95% 11.84% 10.09%
> Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> Q7 194 181 198 8.59% 6.70% -2.06%
> Q8 285 263 275 4.36% 7.72% 3.51%
> Q9 362 345 363 4.96% 4.70% -0.28%
> Q10 101 92 93 1.08% 8.91% 7.92%
> Q11 64 61 62 1.61% 4.69% 3.13%
> Q12 41.4 44 63 30.16% -6.28% -52.17%
> Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> Q15 70 59 80 26.25% 15.71% -14.29%
> Q16 64 60 64 6.25% 6.25% 0.00%
> Q17 426 418 432 3.24% 1.88% -1.41%
> Q18 1015 921 1001 7.99% 9.26% 1.38%
> Q19 62 53 59 10.17% 14.52% 4.84%
> Q20 406 326 426 23.47% 19.70% -4.93%
> Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> *Full Scan Query means count of every coumn of lineitem, In this way we can
> check the full scan query performance.
>
> The above optimization is not just limited to fileformat and Presto
> integration but also improves for CarbonSession integration.
> We can further optimize carbon by the tasks(Vishal is already working on
> it) like adaptive encoding for all types of columns and storing length and
> values in separate pages in case of string datatype.Please refer
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> .
>
> --
> Thanks & Regards,
> Ravi
>

kumarvishal09

Re: CarbonData Performance Optimization

+1
Regards
Kumar Vishal

On Thu, Sep 27, 2018 at 8:57 AM Jacky Li <[hidden email]> wrote:

> +1
>
> > 在 2018年9月21日，上午10:20，Ravindra Pesala <[hidden email]> 写道：
> >
> > Hi,
> >
> > In case of querying data using Spark or Presto, carbondata is not well
> > optimized for reading data and fill the vector. The major issues are as
> > follows.
> > 1. CarbonData has long method stack for reading and filling out the data
> to
> > vector.
> > 2. Many conditions and checks before filling out the data to vector.
> > 3. Maintaining intermediate copies of data leads more CPU utilization.
> > Because of the above issues, there is a high chance of missing the CPU
> > cache while processing the leads to poor performance.
> >
> > So here I am proposing the optimization to fill the vector without much
> > method stack and condition checks and no intermediate copies to utilize
> > more CPU cache.
> >
> > *Full Scan queries:*
> > After decompressing the page in our V3 reader we can immediately fill
> the
> > data to a vector without any condition checks inside loops. So here
> > complete column page data is set to column vector in a single batch and
> > gives back data to Spark/Presto.
> > *Filter Queries:*
> > First, apply page level pruning using the min/max of each page and get
> > the valid pages of blocklet. Decompress only valid pages and fill the
> > vector directly as mentioned in full scan query scenario.
> >
> > In this method, we can also get the advantage of avoiding two times
> > filtering in Spark/Presto as they do the filtering again even though we
> > return the filtered data.
> >
> > Please find the *TPCH performance report of updated carbon* as per the
> > changes mentioned above. Please note that the changes I have done the
> > changes in POC quality so it takes some time to stabilize it.
> >
> > *Configurations*
> > Laptop with i7 processor and 16 GB RAM.
> > TPCH Data Scale: 100 GB
> > No Sort with no inverted index data.
> > Total CarbonData Size : 32 GB
> > Total Parquet Size : 31 GB
> >
> >
> > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New
> > Vs Parquet Carbon old Vs Parquet
> > Q1 101 96 128 25.00% 4.95% -26.73%
> > Q2 85 82 85 3.53% 3.53% 0.00%
> > Q3 118 112 135 17.04% 5.08% -14.41%
> > Q4 473 424 486 12.76% 10.36% -2.75%
> > Q5 228 201 205 1.95% 11.84% 10.09%
> > Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> > Q7 194 181 198 8.59% 6.70% -2.06%
> > Q8 285 263 275 4.36% 7.72% 3.51%
> > Q9 362 345 363 4.96% 4.70% -0.28%
> > Q10 101 92 93 1.08% 8.91% 7.92%
> > Q11 64 61 62 1.61% 4.69% 3.13%
> > Q12 41.4 44 63 30.16% -6.28% -52.17%
> > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> > Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> > Q15 70 59 80 26.25% 15.71% -14.29%
> > Q16 64 60 64 6.25% 6.25% 0.00%
> > Q17 426 418 432 3.24% 1.88% -1.41%
> > Q18 1015 921 1001 7.99% 9.26% 1.38%
> > Q19 62 53 59 10.17% 14.52% 4.84%
> > Q20 406 326 426 23.47% 19.70% -4.93%
> > Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> > *Full Scan Query means count of every coumn of lineitem, In this way we
> can
> > check the full scan query performance.
> >
> > The above optimization is not just limited to fileformat and Presto
> > integration but also improves for CarbonSession integration.
> > We can further optimize carbon by the tasks(Vishal is already working on
> > it) like adaptive encoding for all types of columns and storing length
> and
> > values in separate pages in case of string datatype.Please refer
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> > .
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>
>
>
>

kumar vishal

manishgupta88

Re: CarbonData Performance Optimization

+1

Regards
Manish Gupta

On Thu, 27 Sep 2018 at 11:36 AM, Kumar Vishal <[hidden email]>
wrote:

> +1
> Regards
> Kumar Vishal
>
> On Thu, Sep 27, 2018 at 8:57 AM Jacky Li <[hidden email]> wrote:
>
> > +1
> >
> > > 在 2018年9月21日，上午10:20，Ravindra Pesala <[hidden email]> 写道：
> > >
> > > Hi,
> > >
> > > In case of querying data using Spark or Presto, carbondata is not well
> > > optimized for reading data and fill the vector. The major issues are as
> > > follows.
> > > 1. CarbonData has long method stack for reading and filling out the
> data
> > to
> > > vector.
> > > 2. Many conditions and checks before filling out the data to vector.
> > > 3. Maintaining intermediate copies of data leads more CPU utilization.
> > > Because of the above issues, there is a high chance of missing the CPU
> > > cache while processing the leads to poor performance.
> > >
> > > So here I am proposing the optimization to fill the vector without much
> > > method stack and condition checks and no intermediate copies to utilize
> > > more CPU cache.
> > >
> > > *Full Scan queries:*
> > > After decompressing the page in our V3 reader we can immediately fill
> > the
> > > data to a vector without any condition checks inside loops. So here
> > > complete column page data is set to column vector in a single batch and
> > > gives back data to Spark/Presto.
> > > *Filter Queries:*
> > > First, apply page level pruning using the min/max of each page and get
> > > the valid pages of blocklet. Decompress only valid pages and fill the
> > > vector directly as mentioned in full scan query scenario.
> > >
> > > In this method, we can also get the advantage of avoiding two times
> > > filtering in Spark/Presto as they do the filtering again even though we
> > > return the filtered data.
> > >
> > > Please find the *TPCH performance report of updated carbon* as per the
> > > changes mentioned above. Please note that the changes I have done the
> > > changes in POC quality so it takes some time to stabilize it.
> > >
> > > *Configurations*
> > > Laptop with i7 processor and 16 GB RAM.
> > > TPCH Data Scale: 100 GB
> > > No Sort with no inverted index data.
> > > Total CarbonData Size : 32 GB
> > > Total Parquet Size : 31 GB
> > >
> > >
> > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon
> New
> > > Vs Parquet Carbon old Vs Parquet
> > > Q1 101 96 128 25.00% 4.95% -26.73%
> > > Q2 85 82 85 3.53% 3.53% 0.00%
> > > Q3 118 112 135 17.04% 5.08% -14.41%
> > > Q4 473 424 486 12.76% 10.36% -2.75%
> > > Q5 228 201 205 1.95% 11.84% 10.09%
> > > Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> > > Q7 194 181 198 8.59% 6.70% -2.06%
> > > Q8 285 263 275 4.36% 7.72% 3.51%
> > > Q9 362 345 363 4.96% 4.70% -0.28%
> > > Q10 101 92 93 1.08% 8.91% 7.92%
> > > Q11 64 61 62 1.61% 4.69% 3.13%
> > > Q12 41.4 44 63 30.16% -6.28% -52.17%
> > > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> > > Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> > > Q15 70 59 80 26.25% 15.71% -14.29%
> > > Q16 64 60 64 6.25% 6.25% 0.00%
> > > Q17 426 418 432 3.24% 1.88% -1.41%
> > > Q18 1015 921 1001 7.99% 9.26% 1.38%
> > > Q19 62 53 59 10.17% 14.52% 4.84%
> > > Q20 406 326 426 23.47% 19.70% -4.93%
> > > Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> > > *Full Scan Query means count of every coumn of lineitem, In this way we
> > can
> > > check the full scan query performance.
> > >
> > > The above optimization is not just limited to fileformat and Presto
> > > integration but also improves for CarbonSession integration.
> > > We can further optimize carbon by the tasks(Vishal is already working
> on
> > > it) like adaptive encoding for all types of columns and storing length
> > and
> > > values in separate pages in case of string datatype.Please refer
> > >
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> > > .
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> > >
> >
> >
> >
> >
>

xm_zzc

Re: CarbonData Performance Optimization

In reply to this post by ravipesala

So excited. Good optimization.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/