This post was updated on .
Hi dev:
Recently, I compared the performance of full scan between parquet and carbondata, found that the performance of full scan of carbondata was worse than parquet. *My test:* 1. Spark 2.2 + Parquet with Spark 2.2 + CarbonData(master branch) 2. Run on local[1] mode, 3. There are 8 parquet files in one folder, total: 47474456 records, the size of each file is about *170* MB; 4. There are 8 segments in one carbondata table, total: 47474456 records, each segment has one file, the size of each file is about *220 *MB, there are *4 blocklets and 186 pages* in one carbondata file; 5. The data of each parquet file and carbondata file is the same; 6. create table sql: CREATE TABLE IF NOT EXISTS cll_carbon_small ( ftype string, chan string, ts int , fcip string, fratio int , size long, host string, acarea string, url string, rt string, pdate string, ptime string, code int , ua string, uabro string, uabrov string, uaos string, uaptfm string, uadvc string, cache string, bsize long, msecdl long, refer string, upsize long, fvarf string ) STORED BY 'carbondata' TBLPROPERTIES( 'streaming'='false', 'sort_columns'='chan,ftype,ts,fcip,cache,code', 'LOCAL_DICTIONARY_ENABLE'='true', 'LOCAL_DICTIONARY_THRESHOLD'='50000', 'LOCAL_DICTIONARY_EXCLUDE'='ptime,ua,refer,url,rt,uadvc,fvarf,host', 'MAJOR_COMPACTION_SIZE'='8192', 'COMPACTION_LEVEL_THRESHOLD'='2,8', 'AUTO_LOAD_MERGE'='false', 'SORT_SCOPE'='LOCAL_SORT', 'TABLE_BLOCKSIZE'='512' ); 7. test sql: 1). select count(chan),count(fcip),sum(size) from table; 2). select chan,fcip,sum(size) from table group by chan, fcip order by chan, fcip; *Test result:** SQL1: Parquet: 4s 4s 4s CarbonData: 12s 11s 12s SQL2: Parquet: 11s 10s 11s CarbonData: 18s 18s 19s **Analyse:* I added some time count in code and change the size of CarbonVectorProxy from 4 * 1024 to 32 * 1024, use non-prefetch mode. The time stat (take one test) : 1. BlockletFullScanner.readBlocklet: 169ms; 2. BlockletFullScanner.scanBlocklet: 176ms; 3. DictionaryBasedVectorResultCollector.collectResultInColumnarBatch: 7958ms, in this part, it takes about 200-300ms to handle each blocklet, so it takes totally about 1s to handle one carbondata file, but in carbon stat log it shows that it takes about 1-2s to handle one carbondata file for SQL1 and 2-3s to handle one file for SQL2; 4. In CarbonScanRDD.internalCompute, the iterator will execute 1464 times, each iterate takes about 8-9ms for SQL1 and 10-15ms for SQL2; 5. The total time of 1-3 steps are almost the same for SQL1 and SQL2; *Questions:* 1. any optimization on DictionaryBasedVectorResultCollector.collectResultInColumnarBatch ? 2. It takes about 1s to handle one carbondata file in my time stat, but actually it takes about 1-2s for SQL1 and 2-3s for SQL2 in Spark ui, why? shuffle? compute? 3. Can it support to configurate the size of CarbonVectorProxy to reduce times of iterate? Default value is 4 * 1024 and iterate executes 11616 times. BTW, if the optimization(this mailling thread http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html> mentions) is done, I will use this case to test again. Any feedback is welcome. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
If this property is configurable, how do you want to use it?
Does the changing of this property benefit all your queries? If it doesn’t. A system property may be bad to meet all the queries. Then how about a hint for this property? > On Sep 20, 2018, at 00:02, xm_zzc <[hidden email]> wrote: > > 3. Can it support to configurate the size of CarbonVectorProxy to > reduce times of iterate? Default value is 4 * 1024 and iterate executes > 11616 times. |
Hi chuanyin:
I used SQL1 and SQL2 as test cases and ran on local[4] mode, when the rowNum of CarbonVectorProxy (actually it's the capacity of ColumnarBatch) is 4 * 1024 (default): SQL1: 8s, 9s (run two times), SQL2: 12s, 11s but when it's 16 * 1024: SQL1: 6s, 6s, SQL2: 9s, 8s So the changing of this property benefits my two test cases. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xm_zzc
Hi,
Thanks for testing the performance. We have also observed this performance difference and working on to improve the same. Please check my latest discussion (http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html) to improve scan performance and raised PR (still WIP) for the same. And also there is one more discussion (http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html) to optimize store and improve performance. Regards, Ravindra. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Ravindra:
I re-test my test cases mentioned above with Spark 2.3.2 + CarbonData master branch, the query performance of carbondata are almost the same as the parquet: *Test result:** SQL1: Parquet: 4.6s 4s 3.8s CarbonData: 4.7s 3.6s 3.5s SQL2: Parquet: 9s 8s 8s CarbonData: 9s 8s 8s The query performance of CarbonData has improved a lot (SQL1: 12s to 4s, SQL2: 18 to 8s) while the query performance of parquet has also improved (SQL2: 10s to 8s). That's great. But I saw the test result you mentioned in 'http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html', the query performance of carbondata were almost better than the parquet. I want to know how you tested those cases? And are there other optimizations that have not been merged yet? Regards, Zhichao. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |