Login  Register

Re: Propose configurable page size in MB (via carbon property)

Posted by xuchuanyin on Oct 19, 2018; 8:26am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Propose-configurable-page-size-in-MB-via-carbon-property-tp64889p65461.html

Hi, ajantha.

I just go through your PR and think we may need to rethink about this
feature especially its impact. I leaved a comment under your PR and will
paste it here for further communication in community.

I'm afraid that in common scenarios even we do not face the page size
problems and play in the safe area, carbondata will still call this method
to check the boundaries, which will cause data loading performance
decreasing.
So is there a way to avoid unnecessary checking here?

In my opinion, to determine the upper bound of the number of rows in a page,
the default strategy is 'number based' (32000 as the upper bound). Now you
are adding another additional strategy 'capacity based' (xxMB as the upper
bound).

There can be multiple strategies for per load, the default is [number
based], but the user can also configure [number based, capacity based]. So
before loading, we can get the strategies and apply them while processing.
At the same time, if the strategies is [number based], we do not need to
check the capacity, thus avoiding the problem I mentioned above.

Note that we store the rowId in each page using short, it means that the
number based strategy is a default yet required strategy.

Also, by default, the capacity based strategy is not configured. As for this
strategy, user can add it in:
1. TBLProperties in creating table
2. Options in loading data
3. Options in SdkWriter
4. Options in creating table using spark file format
5. Options in DataFrameWriter

By all means, we should not configure it in system property, because only
few of tables use this feature. However adding it in system property will
decrease their loading performance.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/