Apache CarbonData Dev Mailing List archive

Low Performance of full scan.

Classic

List

Threaded

5 messages Options

xm_zzc

Sep 19, 2018; 4:02pm

Low Performance of full scan.

142 posts

This post was updated on Sep 19, 2018; 4:09pm.

Hi dev:
Recently, I compared the performance of full scan between parquet
and carbondata, found that the performance of full scan of carbondata was
worse than parquet.

*My test:*
1. Spark 2.2 + Parquet with Spark 2.2 + CarbonData(master branch)
2. Run on local[1] mode,
3. There are 8 parquet files in one folder, total: 47474456 records, the size of each file is about *170* MB;
4. There are 8 segments in one carbondata table, total: 47474456 records, each segment has one file, the size of each file is about *220 *MB, there are *4 blocklets and 186 pages* in one carbondata file;
5. The data of each parquet file and carbondata file is the same;
6. create table sql:
CREATE TABLE IF NOT EXISTS cll_carbon_small (
ftype string,
chan string,
ts int ,
fcip string,
fratio int ,
size long,
host string,
acarea string,
url string,
rt string,
pdate string,
ptime string,
code int ,
ua string,
uabro string,
uabrov string,
uaos string,
uaptfm string,
uadvc string,
cache string,
bsize long,
msecdl long,
refer string,
upsize long,
fvarf string
)
STORED BY 'carbondata'
TBLPROPERTIES(
'streaming'='false',
'sort_columns'='chan,ftype,ts,fcip,cache,code',
'LOCAL_DICTIONARY_ENABLE'='true',
'LOCAL_DICTIONARY_THRESHOLD'='50000',
'LOCAL_DICTIONARY_EXCLUDE'='ptime,ua,refer,url,rt,uadvc,fvarf,host',
'MAJOR_COMPACTION_SIZE'='8192',
'COMPACTION_LEVEL_THRESHOLD'='2,8',
'AUTO_LOAD_MERGE'='false',
'SORT_SCOPE'='LOCAL_SORT',
'TABLE_BLOCKSIZE'='512'
);
7. test sql:
1). select count(chan),count(fcip),sum(size) from table;
2). select chan,fcip,sum(size) from table group by chan, fcip order by chan, fcip;

*Test result:**
SQL1: Parquet: 4s 4s 4s
CarbonData: 12s 11s 12s
SQL2: Parquet: 11s 10s 11s
CarbonData: 18s 18s 19s

**Analyse:*
I added some time count in code and change the size of CarbonVectorProxy from 4 * 1024 to 32 * 1024, use non-prefetch mode. The time stat (take one test) :

1. BlockletFullScanner.readBlocklet: 169ms;
2. BlockletFullScanner.scanBlocklet: 176ms;
3. DictionaryBasedVectorResultCollector.collectResultInColumnarBatch: 7958ms, in this part, it takes about 200-300ms to handle each blocklet, so it takes totally about 1s to handle one carbondata file, but in carbon stat log it shows that it takes about 1-2s to handle one carbondata file for SQL1 and
2-3s to handle one file for SQL2;
4. In CarbonScanRDD.internalCompute, the iterator will execute 1464 times, each iterate takes about 8-9ms for SQL1 and 10-15ms for SQL2;
5. The total time of 1-3 steps are almost the same for SQL1 and SQL2;

*Questions:*
1. any optimization on DictionaryBasedVectorResultCollector.collectResultInColumnarBatch ?
2. It takes about 1s to handle one carbondata file in my time stat, but actually it takes about 1-2s for SQL1 and 2-3s for SQL2 in Spark ui, why? shuffle? compute?
3. Can it support to configurate the size of CarbonVectorProxy to reduce times of iterate? Default value is 4 * 1024 and iterate executes 11616 times.

BTW, if the optimization(this mailling thread http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html> mentions) is done, I will use this case to test again.

Any feedback is welcome.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

Sep 20, 2018; 2:27pm

Re: Low Performance of full scan.

118 posts

If this property is configurable, how do you want to use it?

Does the changing of this property benefit all your queries? If it doesn’t. A system property may be bad to meet all the queries. Then how about a hint for this property?

> On Sep 20, 2018, at 00:02, xm_zzc <[hidden email]> wrote:
>
> 3. Can it support to configurate the size of CarbonVectorProxy to
> reduce times of iterate? Default value is 4 * 1024 and iterate executes
> 11616 times.

xm_zzc

Sep 20, 2018; 4:07pm

Re: Low Performance of full scan.

142 posts

Hi chuanyin:
I used SQL1 and SQL2 as test cases and ran on local[4] mode,
when the rowNum of CarbonVectorProxy (actually it's the capacity of
ColumnarBatch) is 4 * 1024 (default):
SQL1: 8s, 9s (run two times), SQL2: 12s, 11s
but when it's 16 * 1024:
SQL1: 6s, 6s, SQL2: 9s, 8s

So the changing of this property benefits my two test cases.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

ravipesala

Sep 21, 2018; 2:26am

Re: Low Performance of full scan.

300 posts

In reply to this post by xm_zzc

Hi,

Thanks for testing the performance. We have also observed this performance
difference and working on to improve the same. Please check my latest
discussion
(http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html)
to improve scan performance and raised PR (still WIP) for the same.
And also there is one more discussion
(http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html)
to optimize store and improve performance.

Regards,
Ravindra.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xm_zzc

Oct 29, 2018; 4:02pm

Re: Low Performance of full scan.

142 posts

Hi Ravindra:
I re-test my test cases mentioned above with Spark 2.3.2 + CarbonData
master branch, the query performance of carbondata are almost the same as
the parquet:

*Test result:**
SQL1: Parquet: 4.6s 4s 3.8s
CarbonData: 4.7s 3.6s 3.5s
SQL2: Parquet: 9s 8s 8s
CarbonData: 9s 8s 8s

The query performance of CarbonData has improved a lot (SQL1: 12s to 4s,
SQL2: 18 to 8s) while the query performance of parquet has also improved
(SQL2: 10s to 8s). That's great.
But I saw the test result you mentioned in
'http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html',
the query performance of carbondata were almost better than the parquet. I
want to know how you tested those cases? And are there other optimizations
that have not been merged yet?

Regards,
Zhichao.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/