Apache CarbonData Dev Mailing List archive

[Discussion] CarbonReader performance improvement

Classic

List

Threaded

6 messages Options

kunalkapoor

Oct 28, 2018; 6:33pm

[Discussion] CarbonReader performance improvement

93 posts

Hi All,
I would like to propose some improvements to CarbonReader implementation to
increase the performance.

1. When filter expression is not provided by the user then instead of
calling getSplits method we can list the carbondata files and treat one
file as one split. This would improve the performance as the time in
loading block/blocklet datamap would be avoided.

2. Implement Vectorized Reader and expose a API for the user to switch
between CarbonReader/Vectorized reader. Additionally an API would be
provided for the user to extract the columnar batch instead of rows. This
would allow the user to have a deeper integration with carbon.
Additionally the reduction in method calls for vector reader would improve
the read time.

3. Add concurrent reading functionality to Carbon Reader. This can be
enabled by passing the number of splits required by the user. If the user
passes 2 as the split for reader then the user would be returned 2
CarbonReaders with equal number of RecordReaders in each.
The user can then run each CarbonReader instance in a separate thread to
read the data concurrently.

The performance report would be shared soon.

Any suggestion from the community is greatly appreciated.

Thanks
Kunal Kapoor

xm_zzc

Oct 30, 2018; 1:30am

Re: [Discussion] CarbonReader performance improvement

142 posts

+1.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xubo245

Oct 30, 2018; 12:15pm

Re: [Discussion] CarbonReader performance improvement

64 posts

In reply to this post by kunalkapoor

1. there are some user want to use filter and have a plan to use in next few month.
except loading dataMap, we also should avoid delete detaMap and check schema folder in SDK

2. it's nice. When I call hasNext in SDK, sometimes it need long time to read, default value of batch only 100 rows。
How many size columnar batch want to support？ can columnar batch support one block？because sometimes we want to read all data。

3. It will improve performance in multiple cores machine
can we keep the sequence when we use multiple thread to read in different machine？

------------------ Original ------------------
From: "Kunal Kapoor";<[hidden email]>;
Send time: Monday, Oct 29, 2018 3:03 AM
To: "dev"<[hidden email]>;

Subject: [Discussion] CarbonReader performance improvement

Hi All,
I would like to propose some improvements to CarbonReader implementation to
increase the performance.

1. When filter expression is not provided by the user then instead of
calling getSplits method we can list the carbondata files and treat one
file as one split. This would improve the performance as the time in
loading block/blocklet datamap would be avoided.

2. Implement Vectorized Reader and expose a API for the user to switch
between CarbonReader/Vectorized reader. Additionally an API would be
provided for the user to extract the columnar batch instead of rows. This
would allow the user to have a deeper integration with carbon.
Additionally the reduction in method calls for vector reader would improve
the read time.

3. Add concurrent reading functionality to Carbon Reader. This can be
enabled by passing the number of splits required by the user. If the user
passes 2 as the split for reader then the user would be returned 2
CarbonReaders with equal number of RecordReaders in each.
The user can then run each CarbonReader instance in a separate thread to
read the data concurrently.

The performance report would be shared soon.

Any suggestion from the community is greatly appreciated.

Thanks
Kunal Kapoor

xubo245

Oct 30, 2018; 12:18pm

Re: [Discussion] CarbonReader performance improvement

64 posts

test name

------------------ Original ------------------
From: "xubo245";<[hidden email]>;
Send time: Tuesday, Oct 30, 2018 8:15 PM
To: "dev"<[hidden email]>;

Subject: Re: [Discussion] CarbonReader performance improvement

1. there are some user want to use filter and have a plan to use in next few month.
except loading dataMap, we also should avoid delete detaMap and check schema folder in SDK

2. it's nice. When I call hasNext in SDK, sometimes it need long time to read, default value of batch only 100 rows。
How many size columnar batch want to support？ can columnar batch support one block？because sometimes we want to read all data。

3. It will improve performance in multiple cores machine
can we keep the sequence when we use multiple thread to read in different machine？

------------------ Original ------------------
From: "Kunal Kapoor";<[hidden email]>;
Send time: Monday, Oct 29, 2018 3:03 AM
To: "dev"<[hidden email]>;

Subject: [Discussion] CarbonReader performance improvement

Hi All,
I would like to propose some improvements to CarbonReader implementation to
increase the performance.

1. When filter expression is not provided by the user then instead of
calling getSplits method we can list the carbondata files and treat one
file as one split. This would improve the performance as the time in
loading block/blocklet datamap would be avoided.

2. Implement Vectorized Reader and expose a API for the user to switch
between CarbonReader/Vectorized reader. Additionally an API would be
provided for the user to extract the columnar batch instead of rows. This
would allow the user to have a deeper integration with carbon.
Additionally the reduction in method calls for vector reader would improve
the read time.

3. Add concurrent reading functionality to Carbon Reader. This can be
enabled by passing the number of splits required by the user. If the user
passes 2 as the split for reader then the user would be returned 2
CarbonReaders with equal number of RecordReaders in each.
The user can then run each CarbonReader instance in a separate thread to
read the data concurrently.

The performance report would be shared soon.

Any suggestion from the community is greatly appreciated.

Thanks
Kunal Kapoor

xuchuanyin

Oct 31, 2018; 8:35am

Re: [Discussion] CarbonReader performance improvement

118 posts

In reply to this post by kunalkapoor

A question here:

"""
3. Add concurrent reading functionality to Carbon Reader. This can be
enabled by passing the number of splits required by the user. If the user
passes 2 as the split for reader then the user would be returned 2
CarbonReaders with equal number of RecordReaders in each.
The user can then run each CarbonReader instance in a separate thread to
read the data concurrently.
"""
===
1. What does the relationship between `RecordReaders` and number of
`DataFiles`/`Blocklets`/`Pages`/`records`?

2. Will the returning CarbonReaders process almost the same number of
`DataFiles`/`Blocklets`/`Pages`/`records`?

3. After the user get 2 new CarbonReaders from the old CarbonReader, Can
user just close the old CarbonReader immediately? What if the user don't
close it and still use the old CarbonReader alongside with the new
CarbonReaders? -- This question is for potential shared state if you
directly use the RecordReader from the old one.

After all, I think it would be better if the changes are in different PR so
that we can review it easily.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Naman Rastogi

Oct 31, 2018; 8:53am

Re: [Discussion] CarbonReader performance improvement

8 posts

> After all, I think it would be better if the changes are in different PR so
> that we can review it easily.
>

*https://github.com/apache/carbondata/pull/2850
<https://github.com/apache/carbondata/pull/2850>*

>
>
> *1. What does the relationship between `RecordReaders` and number of
> `DataFiles`/`Blocklets`/`Pages`/`records`? 2. Will the returning
> CarbonReaders process almost the same number of
> `DataFiles`/`Blocklets`/`Pages`/`records`?*
>

*> The number of CarbonRecordReader / RecordReader in CarbonReader is same
as the no. of files it is reading / going to read.*
*Please go through the PR to get more details on the implementation.*

>
>
>
>
>
> * 3. After the user get 2 new CarbonReaders from the old CarbonReader, Can
> user just close the old CarbonReader immediately? What if the user don't
> close it and still use the old CarbonReader alongside with the new
> CarbonReaders? -- This question is for potential shared state if you
> directly use the RecordReader from the old one.*

*> The user does not have to close it explicitily. It cant be closed, if it
gets closed, the children CarbonReader will not iterate over the files.*
*But it Is taken care internally inside split, such that the original
CarbonReader will not be able to read the files.*

---
Thanks
Naman Rastogi