Login  Register

Low Performance of full scan.

Posted by xm_zzc on Sep 19, 2018; 4:02pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Low-Performance-of-full-scan-tp62859.html

Hi dev:  
  Recently, I compared the performance of full scan between parquet
and carbondata, found that the performance of full scan of carbondata was
worse than parquet.

*My test:*
    1. Spark 2.2 + Parquet with Spark 2.2 + CarbonData(master branch)
    2. Run on local[1] mode,
    3. There are 8 parquet files in one folder, total: 47474456 records, the size of each file is about *170* MB;  
    4. There are 8 segments in one carbondata table, total: 47474456 records, each segment has one file, the size of each file is about *220 *MB, there are *4 blocklets and 186 pages* in one carbondata file;
    5. The data of each parquet file and carbondata file is the same;
    6. create table sql:
        CREATE TABLE IF NOT EXISTS cll_carbon_small (
            ftype    string,
            chan     string,
            ts       int   ,
            fcip     string,
            fratio   int   ,
            size     long,
            host     string,
            acarea   string,
            url      string,
            rt       string,
            pdate    string,
            ptime    string,
            code     int   ,
            ua       string,
            uabro    string,
            uabrov   string,
            uaos     string,
            uaptfm   string,
            uadvc    string,
            cache    string,
            bsize    long,
            msecdl   long,
            refer    string,
            upsize   long,
            fvarf    string
        )
        STORED BY 'carbondata'
        TBLPROPERTIES(
            'streaming'='false',
            'sort_columns'='chan,ftype,ts,fcip,cache,code',
            'LOCAL_DICTIONARY_ENABLE'='true',
            'LOCAL_DICTIONARY_THRESHOLD'='50000',
            'LOCAL_DICTIONARY_EXCLUDE'='ptime,ua,refer,url,rt,uadvc,fvarf,host',
            'MAJOR_COMPACTION_SIZE'='8192',
            'COMPACTION_LEVEL_THRESHOLD'='2,8',
            'AUTO_LOAD_MERGE'='false',
            'SORT_SCOPE'='LOCAL_SORT',
            'TABLE_BLOCKSIZE'='512'
        );
    7. test sql:    
        1). select count(chan),count(fcip),sum(size) from table;  
        2). select chan,fcip,sum(size) from table group by chan, fcip order by chan, fcip;

*Test result:**
  SQL1:    Parquet:          4s       4s       4s  
               CarbonData:   12s      11s      12s  
  SQL2:    Parquet:         11s     10s      11s    
               CarbonData:   18s     18s      19s

**Analyse:*
    I added some time count in code and change the size of CarbonVectorProxy from 4 * 1024 to 32 * 1024, use non-prefetch mode.  The time stat (take one test) :  

    1. BlockletFullScanner.readBlocklet:  169ms;
    2. BlockletFullScanner.scanBlocklet:  176ms;
    3. DictionaryBasedVectorResultCollector.collectResultInColumnarBatch: 7958ms, in this part, it takes about 200-300ms to handle each blocklet, so it takes totally about 1s to handle one carbondata file, but in carbon stat log it shows that it takes about 1-2s to handle one carbondata file for SQL1 and
2-3s to handle one file for SQL2;
    4. In CarbonScanRDD.internalCompute, the iterator will execute 1464 times, each iterate takes about 8-9ms for SQL1 and 10-15ms for SQL2;
    5. The total time of 1-3 steps are almost the same for SQL1 and SQL2;

*Questions:*
    1. any optimization on DictionaryBasedVectorResultCollector.collectResultInColumnarBatch ?
    2. It takes about 1s to handle one carbondata file in my time stat, but actually it takes about 1-2s for SQL1 and 2-3s for SQL2 in Spark ui, why? shuffle? compute?
    3. Can it support to configurate the size of CarbonVectorProxy to reduce times of iterate? Default value is 4 * 1024 and iterate executes 11616 times.

BTW, if the optimization(this mailling thread http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html>  mentions) is done, I will use this case to test again.

Any feedback is welcome.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/