Hi all,
For better in-memory processing of carbondata pages, I am proposing configurable page size in MB (via carbon property). The detail background, problem and solution is added in the design document. Document is attached in the below JIRA. *https://issues.apache.org/jira/browse/CARBONDATA-3001 <https://issues.apache.org/jira/browse/CARBONDATA-3001>* please go through the document in JIRA and let me know if I can go ahead with the implementation. Thanks, Ajantha |
Hi, ajantha.
I just go through your PR and think we may need to rethink about this feature especially its impact. I leaved a comment under your PR and will paste it here for further communication in community. I'm afraid that in common scenarios even we do not face the page size problems and play in the safe area, carbondata will still call this method to check the boundaries, which will cause data loading performance decreasing. So is there a way to avoid unnecessary checking here? In my opinion, to determine the upper bound of the number of rows in a page, the default strategy is 'number based' (32000 as the upper bound). Now you are adding another additional strategy 'capacity based' (xxMB as the upper bound). There can be multiple strategies for per load, the default is [number based], but the user can also configure [number based, capacity based]. So before loading, we can get the strategies and apply them while processing. At the same time, if the strategies is [number based], we do not need to check the capacity, thus avoiding the problem I mentioned above. Note that we store the rowId in each page using short, it means that the number based strategy is a default yet required strategy. Also, by default, the capacity based strategy is not configured. As for this strategy, user can add it in: 1. TBLProperties in creating table 2. Options in loading data 3. Options in SdkWriter 4. Options in creating table using spark file format 5. Options in DataFrameWriter By all means, we should not configure it in system property, because only few of tables use this feature. However adding it in system property will decrease their loading performance. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi xuchuanyin,
Thanks for your inputs. Please find some details below. 1. Already there was a size based validation in code for each row processing. In 'isVarCharColumnFul()' method. It was checking only for varchar columns. Now I am checking complex as well as string columns. 2. The logic is for dividing complex byte array to flat byte array is taken from TablePage.addComplexColumn(). This computation will be moved to my new method and it will be avoided here. So no extra computation. 3. Yes, I will make it as create table property instead of carbon property. Also I will measure Load performance. Once changes are made. Thanks, Ajantha On Fri, Oct 19, 2018 at 1:56 PM xuchuanyin <[hidden email]> wrote: > Hi, ajantha. > > I just go through your PR and think we may need to rethink about this > feature especially its impact. I leaved a comment under your PR and will > paste it here for further communication in community. > > I'm afraid that in common scenarios even we do not face the page size > problems and play in the safe area, carbondata will still call this method > to check the boundaries, which will cause data loading performance > decreasing. > So is there a way to avoid unnecessary checking here? > > In my opinion, to determine the upper bound of the number of rows in a > page, > the default strategy is 'number based' (32000 as the upper bound). Now you > are adding another additional strategy 'capacity based' (xxMB as the upper > bound). > > There can be multiple strategies for per load, the default is [number > based], but the user can also configure [number based, capacity based]. So > before loading, we can get the strategies and apply them while processing. > At the same time, if the strategies is [number based], we do not need to > check the capacity, thus avoiding the problem I mentioned above. > > Note that we store the rowId in each page using short, it means that the > number based strategy is a default yet required strategy. > > Also, by default, the capacity based strategy is not configured. As for > this > strategy, user can add it in: > 1. TBLProperties in creating table > 2. Options in loading data > 3. Options in SdkWriter > 4. Options in creating table using spark file format > 5. Options in DataFrameWriter > > By all means, we should not configure it in system property, because only > few of tables use this feature. However adding it in system property will > decrease their loading performance. > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
OK, anyway please take care of the loading performance. The validation can
only be checked for those fields that may cross the boundary (e.g. varchar and complex), and for the ordinary fields, just skip the validation. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |