Apache CarbonData Dev Mailing List archive

Low Performance of full scan.

Posted by xm_zzc on Sep 19, 2018; 4:02pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Low-Performance-of-full-scan-tp62859.html

Hi dev:
Recently, I compared the performance of full scan between parquet
and carbondata, found that the performance of full scan of carbondata was
worse than parquet.

*My test:*
1. Spark 2.2 + Parquet with Spark 2.2 + CarbonData(master branch)
2. Run on local[1] mode,
3. There are 8 parquet files in one folder, total: 47474456 records, the size of each file is about *170* MB;
4. There are 8 segments in one carbondata table, total: 47474456 records, each segment has one file, the size of each file is about *220 *MB, there are *4 blocklets and 186 pages* in one carbondata file;
5. The data of each parquet file and carbondata file is the same;
6. create table sql:
CREATE TABLE IF NOT EXISTS cll_carbon_small (
ftype string,
chan string,
ts int ,
fcip string,
fratio int ,
size long,
host string,
acarea string,
url string,
rt string,
pdate string,
ptime string,
code int ,
ua string,
uabro string,
uabrov string,
uaos string,
uaptfm string,
uadvc string,
cache string,
bsize long,
msecdl long,
refer string,
upsize long,
fvarf string
)
STORED BY 'carbondata'
TBLPROPERTIES(
'streaming'='false',
'sort_columns'='chan,ftype,ts,fcip,cache,code',
'LOCAL_DICTIONARY_ENABLE'='true',
'LOCAL_DICTIONARY_THRESHOLD'='50000',
'LOCAL_DICTIONARY_EXCLUDE'='ptime,ua,refer,url,rt,uadvc,fvarf,host',
'MAJOR_COMPACTION_SIZE'='8192',
'COMPACTION_LEVEL_THRESHOLD'='2,8',
'AUTO_LOAD_MERGE'='false',
'SORT_SCOPE'='LOCAL_SORT',
'TABLE_BLOCKSIZE'='512'
);
7. test sql:
1). select count(chan),count(fcip),sum(size) from table;
2). select chan,fcip,sum(size) from table group by chan, fcip order by chan, fcip;

*Test result:**
SQL1: Parquet: 4s 4s 4s
CarbonData: 12s 11s 12s
SQL2: Parquet: 11s 10s 11s
CarbonData: 18s 18s 19s

**Analyse:*
I added some time count in code and change the size of CarbonVectorProxy from 4 * 1024 to 32 * 1024, use non-prefetch mode. The time stat (take one test) :

1. BlockletFullScanner.readBlocklet: 169ms;
2. BlockletFullScanner.scanBlocklet: 176ms;
3. DictionaryBasedVectorResultCollector.collectResultInColumnarBatch: 7958ms, in this part, it takes about 200-300ms to handle each blocklet, so it takes totally about 1s to handle one carbondata file, but in carbon stat log it shows that it takes about 1-2s to handle one carbondata file for SQL1 and
2-3s to handle one file for SQL2;
4. In CarbonScanRDD.internalCompute, the iterator will execute 1464 times, each iterate takes about 8-9ms for SQL1 and 10-15ms for SQL2;
5. The total time of 1-3 steps are almost the same for SQL1 and SQL2;

*Questions:*
1. any optimization on DictionaryBasedVectorResultCollector.collectResultInColumnarBatch ?
2. It takes about 1s to handle one carbondata file in my time stat, but actually it takes about 1-2s for SQL1 and 2-3s for SQL2 in Spark ui, why? shuffle? compute?
3. Can it support to configurate the size of CarbonVectorProxy to reduce times of iterate? Default value is 4 * 1024 and iterate executes 11616 times.

BTW, if the optimization(this mailling thread http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html> mentions) is done, I will use this case to test again.

Any feedback is welcome.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/