Hi,
In case of querying data using Spark or Presto, carbondata is not well optimized for reading data and fill the vector. The major issues are as follows. 1. CarbonData has long method stack for reading and filling out the data to vector. 2. Many conditions and checks before filling out the data to vector. 3. Maintaining intermediate copies of data leads more CPU utilization. Because of the above issues, there is a high chance of missing the CPU cache while processing the leads to poor performance. So here I am proposing the optimization to fill the vector without much method stack and condition checks and no intermediate copies to utilize more CPU cache. *Full Scan queries:* After decompressing the page in our V3 reader we can immediately fill the data to a vector without any condition checks inside loops. So here complete column page data is set to column vector in a single batch and gives back data to Spark/Presto. *Filter Queries:* First, apply page level pruning using the min/max of each page and get the valid pages of blocklet. Decompress only valid pages and fill the vector directly as mentioned in full scan query scenario. In this method, we can also get the advantage of avoiding two times filtering in Spark/Presto as they do the filtering again even though we return the filtered data. Please find the *TPCH performance report of updated carbon* as per the changes mentioned above. Please note that the changes I have done the changes in POC quality so it takes some time to stabilize it. *Configurations* Laptop with i7 processor and 16 GB RAM. TPCH Data Scale: 100 GB No Sort with no inverted index data. Total CarbonData Size : 32 GB Total Parquet Size : 31 GB Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New Vs Parquet Carbon old Vs Parquet Q1 101 96 128 25.00% 4.95% -26.73% Q2 85 82 85 3.53% 3.53% 0.00% Q3 118 112 135 17.04% 5.08% -14.41% Q4 473 424 486 12.76% 10.36% -2.75% Q5 228 201 205 1.95% 11.84% 10.09% Q6 19.2 19.2 48 60.00% 0.00% -150.00% Q7 194 181 198 8.59% 6.70% -2.06% Q8 285 263 275 4.36% 7.72% 3.51% Q9 362 345 363 4.96% 4.70% -0.28% Q10 101 92 93 1.08% 8.91% 7.92% Q11 64 61 62 1.61% 4.69% 3.13% Q12 41.4 44 63 30.16% -6.28% -52.17% Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% Q14 36.9 31.5 41 23.17% 14.63% -11.11% Q15 70 59 80 26.25% 15.71% -14.29% Q16 64 60 64 6.25% 6.25% 0.00% Q17 426 418 432 3.24% 1.88% -1.41% Q18 1015 921 1001 7.99% 9.26% 1.38% Q19 62 53 59 10.17% 14.52% 4.84% Q20 406 326 426 23.47% 19.70% -4.93% Full Scan Query* 140 116 164 29.27% 17.14% -17.14% *Full Scan Query means count of every coumn of lineitem, In this way we can check the full scan query performance. The above optimization is not just limited to fileformat and Presto integration but also improves for CarbonSession integration. We can further optimize carbon by the tasks(Vishal is already working on it) like adaptive encoding for all types of columns and storing length and values in separate pages in case of string datatype.Please refer http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html . -- Thanks & Regards, Ravi |
Administrator
|
Hi
+1, great proposal, very expect to see your pull request. Regards Liang ravipesala wrote > Hi, > > In case of querying data using Spark or Presto, carbondata is not well > optimized for reading data and fill the vector. The major issues are as > follows. > 1. CarbonData has long method stack for reading and filling out the data > to > vector. > 2. Many conditions and checks before filling out the data to vector. > 3. Maintaining intermediate copies of data leads more CPU utilization. > Because of the above issues, there is a high chance of missing the CPU > cache while processing the leads to poor performance. > > So here I am proposing the optimization to fill the vector without much > method stack and condition checks and no intermediate copies to utilize > more CPU cache. > > *Full Scan queries:* > After decompressing the page in our V3 reader we can immediately fill > the > data to a vector without any condition checks inside loops. So here > complete column page data is set to column vector in a single batch and > gives back data to Spark/Presto. > *Filter Queries:* > First, apply page level pruning using the min/max of each page and get > the valid pages of blocklet. Decompress only valid pages and fill the > vector directly as mentioned in full scan query scenario. > > In this method, we can also get the advantage of avoiding two times > filtering in Spark/Presto as they do the filtering again even though we > return the filtered data. > > Please find the *TPCH performance report of updated carbon* as per the > changes mentioned above. Please note that the changes I have done the > changes in POC quality so it takes some time to stabilize it. > > *Configurations* > Laptop with i7 processor and 16 GB RAM. > TPCH Data Scale: 100 GB > No Sort with no inverted index data. > Total CarbonData Size : 32 GB > Total Parquet Size : 31 GB > > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New > Vs Parquet Carbon old Vs Parquet > Q1 101 96 128 25.00% 4.95% -26.73% > Q2 85 82 85 3.53% 3.53% 0.00% > Q3 118 112 135 17.04% 5.08% -14.41% > Q4 473 424 486 12.76% 10.36% -2.75% > Q5 228 201 205 1.95% 11.84% 10.09% > Q6 19.2 19.2 48 60.00% 0.00% -150.00% > Q7 194 181 198 8.59% 6.70% -2.06% > Q8 285 263 275 4.36% 7.72% 3.51% > Q9 362 345 363 4.96% 4.70% -0.28% > Q10 101 92 93 1.08% 8.91% 7.92% > Q11 64 61 62 1.61% 4.69% 3.13% > Q12 41.4 44 63 30.16% -6.28% -52.17% > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% > Q14 36.9 31.5 41 23.17% 14.63% -11.11% > Q15 70 59 80 26.25% 15.71% -14.29% > Q16 64 60 64 6.25% 6.25% 0.00% > Q17 426 418 432 3.24% 1.88% -1.41% > Q18 1015 921 1001 7.99% 9.26% 1.38% > Q19 62 53 59 10.17% 14.52% 4.84% > Q20 406 326 426 23.47% 19.70% -4.93% > Full Scan Query* 140 116 164 29.27% 17.14% -17.14% > *Full Scan Query means count of every coumn of lineitem, In this way we > can > check the full scan query performance. > > The above optimization is not just limited to fileformat and Presto > integration but also improves for CarbonSession integration. > We can further optimize carbon by the tasks(Vishal is already working on > it) like adaptive encoding for all types of columns and storing length and > values in separate pages in case of string datatype.Please refer > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html > . > > -- > Thanks & Regards, > Ravi -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by ravipesala
+1
> 在 2018年9月21日,上午10:20,Ravindra Pesala <[hidden email]> 写道: > > Hi, > > In case of querying data using Spark or Presto, carbondata is not well > optimized for reading data and fill the vector. The major issues are as > follows. > 1. CarbonData has long method stack for reading and filling out the data to > vector. > 2. Many conditions and checks before filling out the data to vector. > 3. Maintaining intermediate copies of data leads more CPU utilization. > Because of the above issues, there is a high chance of missing the CPU > cache while processing the leads to poor performance. > > So here I am proposing the optimization to fill the vector without much > method stack and condition checks and no intermediate copies to utilize > more CPU cache. > > *Full Scan queries:* > After decompressing the page in our V3 reader we can immediately fill the > data to a vector without any condition checks inside loops. So here > complete column page data is set to column vector in a single batch and > gives back data to Spark/Presto. > *Filter Queries:* > First, apply page level pruning using the min/max of each page and get > the valid pages of blocklet. Decompress only valid pages and fill the > vector directly as mentioned in full scan query scenario. > > In this method, we can also get the advantage of avoiding two times > filtering in Spark/Presto as they do the filtering again even though we > return the filtered data. > > Please find the *TPCH performance report of updated carbon* as per the > changes mentioned above. Please note that the changes I have done the > changes in POC quality so it takes some time to stabilize it. > > *Configurations* > Laptop with i7 processor and 16 GB RAM. > TPCH Data Scale: 100 GB > No Sort with no inverted index data. > Total CarbonData Size : 32 GB > Total Parquet Size : 31 GB > > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New > Vs Parquet Carbon old Vs Parquet > Q1 101 96 128 25.00% 4.95% -26.73% > Q2 85 82 85 3.53% 3.53% 0.00% > Q3 118 112 135 17.04% 5.08% -14.41% > Q4 473 424 486 12.76% 10.36% -2.75% > Q5 228 201 205 1.95% 11.84% 10.09% > Q6 19.2 19.2 48 60.00% 0.00% -150.00% > Q7 194 181 198 8.59% 6.70% -2.06% > Q8 285 263 275 4.36% 7.72% 3.51% > Q9 362 345 363 4.96% 4.70% -0.28% > Q10 101 92 93 1.08% 8.91% 7.92% > Q11 64 61 62 1.61% 4.69% 3.13% > Q12 41.4 44 63 30.16% -6.28% -52.17% > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% > Q14 36.9 31.5 41 23.17% 14.63% -11.11% > Q15 70 59 80 26.25% 15.71% -14.29% > Q16 64 60 64 6.25% 6.25% 0.00% > Q17 426 418 432 3.24% 1.88% -1.41% > Q18 1015 921 1001 7.99% 9.26% 1.38% > Q19 62 53 59 10.17% 14.52% 4.84% > Q20 406 326 426 23.47% 19.70% -4.93% > Full Scan Query* 140 116 164 29.27% 17.14% -17.14% > *Full Scan Query means count of every coumn of lineitem, In this way we can > check the full scan query performance. > > The above optimization is not just limited to fileformat and Presto > integration but also improves for CarbonSession integration. > We can further optimize carbon by the tasks(Vishal is already working on > it) like adaptive encoding for all types of columns and storing length and > values in separate pages in case of string datatype.Please refer > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html > . > > -- > Thanks & Regards, > Ravi > |
+1
Regards Kumar Vishal On Thu, Sep 27, 2018 at 8:57 AM Jacky Li <[hidden email]> wrote: > +1 > > > 在 2018年9月21日,上午10:20,Ravindra Pesala <[hidden email]> 写道: > > > > Hi, > > > > In case of querying data using Spark or Presto, carbondata is not well > > optimized for reading data and fill the vector. The major issues are as > > follows. > > 1. CarbonData has long method stack for reading and filling out the data > to > > vector. > > 2. Many conditions and checks before filling out the data to vector. > > 3. Maintaining intermediate copies of data leads more CPU utilization. > > Because of the above issues, there is a high chance of missing the CPU > > cache while processing the leads to poor performance. > > > > So here I am proposing the optimization to fill the vector without much > > method stack and condition checks and no intermediate copies to utilize > > more CPU cache. > > > > *Full Scan queries:* > > After decompressing the page in our V3 reader we can immediately fill > the > > data to a vector without any condition checks inside loops. So here > > complete column page data is set to column vector in a single batch and > > gives back data to Spark/Presto. > > *Filter Queries:* > > First, apply page level pruning using the min/max of each page and get > > the valid pages of blocklet. Decompress only valid pages and fill the > > vector directly as mentioned in full scan query scenario. > > > > In this method, we can also get the advantage of avoiding two times > > filtering in Spark/Presto as they do the filtering again even though we > > return the filtered data. > > > > Please find the *TPCH performance report of updated carbon* as per the > > changes mentioned above. Please note that the changes I have done the > > changes in POC quality so it takes some time to stabilize it. > > > > *Configurations* > > Laptop with i7 processor and 16 GB RAM. > > TPCH Data Scale: 100 GB > > No Sort with no inverted index data. > > Total CarbonData Size : 32 GB > > Total Parquet Size : 31 GB > > > > > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New > > Vs Parquet Carbon old Vs Parquet > > Q1 101 96 128 25.00% 4.95% -26.73% > > Q2 85 82 85 3.53% 3.53% 0.00% > > Q3 118 112 135 17.04% 5.08% -14.41% > > Q4 473 424 486 12.76% 10.36% -2.75% > > Q5 228 201 205 1.95% 11.84% 10.09% > > Q6 19.2 19.2 48 60.00% 0.00% -150.00% > > Q7 194 181 198 8.59% 6.70% -2.06% > > Q8 285 263 275 4.36% 7.72% 3.51% > > Q9 362 345 363 4.96% 4.70% -0.28% > > Q10 101 92 93 1.08% 8.91% 7.92% > > Q11 64 61 62 1.61% 4.69% 3.13% > > Q12 41.4 44 63 30.16% -6.28% -52.17% > > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% > > Q14 36.9 31.5 41 23.17% 14.63% -11.11% > > Q15 70 59 80 26.25% 15.71% -14.29% > > Q16 64 60 64 6.25% 6.25% 0.00% > > Q17 426 418 432 3.24% 1.88% -1.41% > > Q18 1015 921 1001 7.99% 9.26% 1.38% > > Q19 62 53 59 10.17% 14.52% 4.84% > > Q20 406 326 426 23.47% 19.70% -4.93% > > Full Scan Query* 140 116 164 29.27% 17.14% -17.14% > > *Full Scan Query means count of every coumn of lineitem, In this way we > can > > check the full scan query performance. > > > > The above optimization is not just limited to fileformat and Presto > > integration but also improves for CarbonSession integration. > > We can further optimize carbon by the tasks(Vishal is already working on > > it) like adaptive encoding for all types of columns and storing length > and > > values in separate pages in case of string datatype.Please refer > > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html > > . > > > > -- > > Thanks & Regards, > > Ravi > > > > > >
kumar vishal
|
+1
Regards Manish Gupta On Thu, 27 Sep 2018 at 11:36 AM, Kumar Vishal <[hidden email]> wrote: > +1 > Regards > Kumar Vishal > > On Thu, Sep 27, 2018 at 8:57 AM Jacky Li <[hidden email]> wrote: > > > +1 > > > > > 在 2018年9月21日,上午10:20,Ravindra Pesala <[hidden email]> 写道: > > > > > > Hi, > > > > > > In case of querying data using Spark or Presto, carbondata is not well > > > optimized for reading data and fill the vector. The major issues are as > > > follows. > > > 1. CarbonData has long method stack for reading and filling out the > data > > to > > > vector. > > > 2. Many conditions and checks before filling out the data to vector. > > > 3. Maintaining intermediate copies of data leads more CPU utilization. > > > Because of the above issues, there is a high chance of missing the CPU > > > cache while processing the leads to poor performance. > > > > > > So here I am proposing the optimization to fill the vector without much > > > method stack and condition checks and no intermediate copies to utilize > > > more CPU cache. > > > > > > *Full Scan queries:* > > > After decompressing the page in our V3 reader we can immediately fill > > the > > > data to a vector without any condition checks inside loops. So here > > > complete column page data is set to column vector in a single batch and > > > gives back data to Spark/Presto. > > > *Filter Queries:* > > > First, apply page level pruning using the min/max of each page and get > > > the valid pages of blocklet. Decompress only valid pages and fill the > > > vector directly as mentioned in full scan query scenario. > > > > > > In this method, we can also get the advantage of avoiding two times > > > filtering in Spark/Presto as they do the filtering again even though we > > > return the filtered data. > > > > > > Please find the *TPCH performance report of updated carbon* as per the > > > changes mentioned above. Please note that the changes I have done the > > > changes in POC quality so it takes some time to stabilize it. > > > > > > *Configurations* > > > Laptop with i7 processor and 16 GB RAM. > > > TPCH Data Scale: 100 GB > > > No Sort with no inverted index data. > > > Total CarbonData Size : 32 GB > > > Total Parquet Size : 31 GB > > > > > > > > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon > New > > > Vs Parquet Carbon old Vs Parquet > > > Q1 101 96 128 25.00% 4.95% -26.73% > > > Q2 85 82 85 3.53% 3.53% 0.00% > > > Q3 118 112 135 17.04% 5.08% -14.41% > > > Q4 473 424 486 12.76% 10.36% -2.75% > > > Q5 228 201 205 1.95% 11.84% 10.09% > > > Q6 19.2 19.2 48 60.00% 0.00% -150.00% > > > Q7 194 181 198 8.59% 6.70% -2.06% > > > Q8 285 263 275 4.36% 7.72% 3.51% > > > Q9 362 345 363 4.96% 4.70% -0.28% > > > Q10 101 92 93 1.08% 8.91% 7.92% > > > Q11 64 61 62 1.61% 4.69% 3.13% > > > Q12 41.4 44 63 30.16% -6.28% -52.17% > > > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% > > > Q14 36.9 31.5 41 23.17% 14.63% -11.11% > > > Q15 70 59 80 26.25% 15.71% -14.29% > > > Q16 64 60 64 6.25% 6.25% 0.00% > > > Q17 426 418 432 3.24% 1.88% -1.41% > > > Q18 1015 921 1001 7.99% 9.26% 1.38% > > > Q19 62 53 59 10.17% 14.52% 4.84% > > > Q20 406 326 426 23.47% 19.70% -4.93% > > > Full Scan Query* 140 116 164 29.27% 17.14% -17.14% > > > *Full Scan Query means count of every coumn of lineitem, In this way we > > can > > > check the full scan query performance. > > > > > > The above optimization is not just limited to fileformat and Presto > > > integration but also improves for CarbonSession integration. > > > We can further optimize carbon by the tasks(Vishal is already working > on > > > it) like adaptive encoding for all types of columns and storing length > > and > > > values in separate pages in case of string datatype.Please refer > > > > > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html > > > . > > > > > > -- > > > Thanks & Regards, > > > Ravi > > > > > > > > > > > > |
In reply to this post by ravipesala
So excited. Good optimization.
-- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |