Propose configurable page size in MB (via carbon property)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Propose configurable page size in MB (via carbon property)

Ajantha Bhat
Hi all,
For better in-memory processing of carbondata pages, I am proposing
configurable page size in MB (via carbon property).

The detail background, problem and solution is added in the design document.
Document is attached in the below JIRA.
*https://issues.apache.org/jira/browse/CARBONDATA-3001
<https://issues.apache.org/jira/browse/CARBONDATA-3001>*

please go through the document in JIRA and let me know if I can go ahead
with the implementation.

Thanks,
Ajantha
Reply | Threaded
Open this post in threaded view
|

Re: Propose configurable page size in MB (via carbon property)

xuchuanyin
Hi, ajantha.

I just go through your PR and think we may need to rethink about this
feature especially its impact. I leaved a comment under your PR and will
paste it here for further communication in community.

I'm afraid that in common scenarios even we do not face the page size
problems and play in the safe area, carbondata will still call this method
to check the boundaries, which will cause data loading performance
decreasing.
So is there a way to avoid unnecessary checking here?

In my opinion, to determine the upper bound of the number of rows in a page,
the default strategy is 'number based' (32000 as the upper bound). Now you
are adding another additional strategy 'capacity based' (xxMB as the upper
bound).

There can be multiple strategies for per load, the default is [number
based], but the user can also configure [number based, capacity based]. So
before loading, we can get the strategies and apply them while processing.
At the same time, if the strategies is [number based], we do not need to
check the capacity, thus avoiding the problem I mentioned above.

Note that we store the rowId in each page using short, it means that the
number based strategy is a default yet required strategy.

Also, by default, the capacity based strategy is not configured. As for this
strategy, user can add it in:
1. TBLProperties in creating table
2. Options in loading data
3. Options in SdkWriter
4. Options in creating table using spark file format
5. Options in DataFrameWriter

By all means, we should not configure it in system property, because only
few of tables use this feature. However adding it in system property will
decrease their loading performance.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Propose configurable page size in MB (via carbon property)

Ajantha Bhat
Hi xuchuanyin,

Thanks for your inputs. Please find some details below.

1. Already there was a size based validation in code for each row
processing.
In 'isVarCharColumnFul()' method. It was checking only for varchar columns.
Now I am checking complex as well as string columns.

2. The logic is for dividing complex byte array to flat byte array is taken
from TablePage.addComplexColumn(). This computation will be moved to my new
method and it will be avoided here.
So no extra computation.

3. Yes,  I will make it as create table property instead of carbon property.
Also I will measure Load performance. Once changes are made.

Thanks,
Ajantha


On Fri, Oct 19, 2018 at 1:56 PM xuchuanyin <[hidden email]> wrote:

> Hi, ajantha.
>
> I just go through your PR and think we may need to rethink about this
> feature especially its impact. I leaved a comment under your PR and will
> paste it here for further communication in community.
>
> I'm afraid that in common scenarios even we do not face the page size
> problems and play in the safe area, carbondata will still call this method
> to check the boundaries, which will cause data loading performance
> decreasing.
> So is there a way to avoid unnecessary checking here?
>
> In my opinion, to determine the upper bound of the number of rows in a
> page,
> the default strategy is 'number based' (32000 as the upper bound). Now you
> are adding another additional strategy 'capacity based' (xxMB as the upper
> bound).
>
> There can be multiple strategies for per load, the default is [number
> based], but the user can also configure [number based, capacity based]. So
> before loading, we can get the strategies and apply them while processing.
> At the same time, if the strategies is [number based], we do not need to
> check the capacity, thus avoiding the problem I mentioned above.
>
> Note that we store the rowId in each page using short, it means that the
> number based strategy is a default yet required strategy.
>
> Also, by default, the capacity based strategy is not configured. As for
> this
> strategy, user can add it in:
> 1. TBLProperties in creating table
> 2. Options in loading data
> 3. Options in SdkWriter
> 4. Options in creating table using spark file format
> 5. Options in DataFrameWriter
>
> By all means, we should not configure it in system property, because only
> few of tables use this feature. However adding it in system property will
> decrease their loading performance.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Propose configurable page size in MB (via carbon property)

xuchuanyin
OK, anyway please take care of the loading performance. The validation can
only be checked for those fields that may cross the boundary (e.g. varchar
and complex), and for the ordinary fields, just skip the validation.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/