Low Performance of full scan.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Low Performance of full scan.

xm_zzc
This post was updated on .
Hi dev:  
  Recently, I compared the performance of full scan between parquet
and carbondata, found that the performance of full scan of carbondata was
worse than parquet.

*My test:*
    1. Spark 2.2 + Parquet with Spark 2.2 + CarbonData(master branch)
    2. Run on local[1] mode,
    3. There are 8 parquet files in one folder, total: 47474456 records, the size of each file is about *170* MB;  
    4. There are 8 segments in one carbondata table, total: 47474456 records, each segment has one file, the size of each file is about *220 *MB, there are *4 blocklets and 186 pages* in one carbondata file;
    5. The data of each parquet file and carbondata file is the same;
    6. create table sql:
        CREATE TABLE IF NOT EXISTS cll_carbon_small (
            ftype    string,
            chan     string,
            ts       int   ,
            fcip     string,
            fratio   int   ,
            size     long,
            host     string,
            acarea   string,
            url      string,
            rt       string,
            pdate    string,
            ptime    string,
            code     int   ,
            ua       string,
            uabro    string,
            uabrov   string,
            uaos     string,
            uaptfm   string,
            uadvc    string,
            cache    string,
            bsize    long,
            msecdl   long,
            refer    string,
            upsize   long,
            fvarf    string
        )
        STORED BY 'carbondata'
        TBLPROPERTIES(
            'streaming'='false',
            'sort_columns'='chan,ftype,ts,fcip,cache,code',
            'LOCAL_DICTIONARY_ENABLE'='true',
            'LOCAL_DICTIONARY_THRESHOLD'='50000',
            'LOCAL_DICTIONARY_EXCLUDE'='ptime,ua,refer,url,rt,uadvc,fvarf,host',
            'MAJOR_COMPACTION_SIZE'='8192',
            'COMPACTION_LEVEL_THRESHOLD'='2,8',
            'AUTO_LOAD_MERGE'='false',
            'SORT_SCOPE'='LOCAL_SORT',
            'TABLE_BLOCKSIZE'='512'
        );
    7. test sql:    
        1). select count(chan),count(fcip),sum(size) from table;  
        2). select chan,fcip,sum(size) from table group by chan, fcip order by chan, fcip;

*Test result:**
  SQL1:    Parquet:          4s       4s       4s  
               CarbonData:   12s      11s      12s  
  SQL2:    Parquet:         11s     10s      11s    
               CarbonData:   18s     18s      19s

**Analyse:*
    I added some time count in code and change the size of CarbonVectorProxy from 4 * 1024 to 32 * 1024, use non-prefetch mode.  The time stat (take one test) :  

    1. BlockletFullScanner.readBlocklet:  169ms;
    2. BlockletFullScanner.scanBlocklet:  176ms;
    3. DictionaryBasedVectorResultCollector.collectResultInColumnarBatch: 7958ms, in this part, it takes about 200-300ms to handle each blocklet, so it takes totally about 1s to handle one carbondata file, but in carbon stat log it shows that it takes about 1-2s to handle one carbondata file for SQL1 and
2-3s to handle one file for SQL2;
    4. In CarbonScanRDD.internalCompute, the iterator will execute 1464 times, each iterate takes about 8-9ms for SQL1 and 10-15ms for SQL2;
    5. The total time of 1-3 steps are almost the same for SQL1 and SQL2;

*Questions:*
    1. any optimization on DictionaryBasedVectorResultCollector.collectResultInColumnarBatch ?
    2. It takes about 1s to handle one carbondata file in my time stat, but actually it takes about 1-2s for SQL1 and 2-3s for SQL2 in Spark ui, why? shuffle? compute?
    3. Can it support to configurate the size of CarbonVectorProxy to reduce times of iterate? Default value is 4 * 1024 and iterate executes 11616 times.

BTW, if the optimization(this mailling thread http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html>  mentions) is done, I will use this case to test again.

Any feedback is welcome.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ 
Reply | Threaded
Open this post in threaded view
|

Re: Low Performance of full scan.

xuchuanyin
If this property is configurable, how do you want to use it?

Does the changing of this property benefit all your queries? If it doesn’t. A system property may be bad to meet all the queries. Then how about a hint for this property?

> On Sep 20, 2018, at 00:02, xm_zzc <[hidden email]> wrote:
>
> 3. Can it support to configurate the size of CarbonVectorProxy to
> reduce times of iterate? Default value is 4 * 1024 and iterate executes
> 11616 times.

Reply | Threaded
Open this post in threaded view
|

Re: Low Performance of full scan.

xm_zzc
Hi chuanyin:
  I used SQL1 and SQL2 as test cases and ran on local[4] mode,
  when the rowNum of CarbonVectorProxy (actually it's the capacity of
ColumnarBatch) is 4 * 1024 (default):
  SQL1: 8s, 9s (run two times), SQL2: 12s, 11s
  but when it's 16 * 1024:
  SQL1: 6s, 6s,                         SQL2: 9s, 8s

  So the changing of this property benefits my two test cases.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Low Performance of full scan.

ravipesala
In reply to this post by xm_zzc
Hi,

Thanks for testing the performance. We have also observed this performance
difference and working on to improve the same. Please check my latest
discussion
(http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html)
to improve scan performance and raised PR (still WIP) for the same.
And also there is one more discussion
(http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html)
to optimize store and improve performance.

Regards,
Ravindra.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Low Performance of full scan.

xm_zzc
Hi Ravindra:
    I re-test my test cases mentioned above with Spark 2.3.2 + CarbonData
master branch, the query performance of carbondata are almost the same as
the parquet:

*Test result:**
  SQL1:    Parquet:      4.6s       4s         3.8s  
           CarbonData:   4.7s       3.6s       3.5s  
  SQL2:    Parquet:      9s     8s      8s    
           CarbonData:   9s     8s      8s

  The query performance of CarbonData has improved a lot (SQL1: 12s to 4s,
SQL2: 18 to 8s) while the query performance of parquet has also improved
(SQL2: 10s to 8s). That's great.
  But I saw the test result you mentioned in
'http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html',
the query performance of carbondata were almost better than the parquet. I
want to know how you tested those cases? And are there other optimizations
that have not been merged yet?

Regards,
Zhichao.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/