GitHub user xuchuanyin opened a pull request:
https://github.com/apache/carbondata/pull/2383 [CARBONDATA-2615][32K] Support page size less than 32000 in CarbondataV3 Since we support super long string, if it is long enough, a column page with 32000 rows will exceed 2GB, so we support a page less than 32000 rows. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [x] Any interfaces changed? `NO` - [x] Any backward compatibility impacted? `NO` - [x] Document update required? `NO` - [x] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? `Tests added` - How it is tested? Please attach test report. `Tested in local` - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuchuanyin/carbondata 0620_long_string_decrease_pagesize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2383 ---- commit b689d66493521452ff9938415e0d0aa66b56c2c5 Author: xuchuanyin <xuchuanyin@...> Date: 2018-06-02T07:17:04Z Support string longer than 32000 characters Add a table property 'long_string_columns' in create table DDL that indicate those columns will contain more than 32000 characters. Internally in Carbondata, 1. add a new datatype called `text` to represent the long string column 2. add a new encoding called `DIRECT_COMPRESS_TEXT` to the text column page meta 3. Use an integer (previously short) to store the length of bytes content. commit f145c6c60238c400b5db6a6bf2696246b698154a Author: xuchuanyin <xuchuanyin@...> Date: 2018-06-05T12:46:26Z rename datatype name from text to varchar commit 4180f8118d1ff90205b0f1567bef2cdfee3a1b62 Author: xuchuanyin <xuchuanyin@...> Date: 2018-06-12T12:35:58Z Add 2GB constraint for one column page commit 710845b155ed5b7a611a900c70b0d766d80ae48d Author: xuchuanyin <xuchuanyin@...> Date: 2018-06-14T12:11:40Z update tests commit 74106d2793ed97615a439576b1c16d34bfaa3ab7 Author: xuchuanyin <xuchuanyin@...> Date: 2018-06-19T07:49:57Z support write long string from dataframe commit 7d4325aa31dccbe4f7858f39de3378eafff30016 Author: xuchuanyin <xuchuanyin@...> Date: 2018-06-19T09:21:04Z Support page size less than 32000 in CarbondataV3 Since we support super long string, if it is long enough, a column page with 32000 rows will exceed 2GB, so we support a page less than 32000 rows. ---- --- |
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2383 This PR depends on #2382 --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2383 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6373/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2383 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5211/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2383 retest it please --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2383 retest this please --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2383 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5321/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2383 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6380/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2383 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5215/ --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:
https://github.com/apache/carbondata/pull/2383 @xuchuanyin I think better to restrict based on number of bytes 67104 for each column value, as user may not know how many character will be present , so its hard for the user to configure blocklet size. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2383 @kumarvishal09 I asked someone who has the longstring requirement and get the response that the length of string is about 100K. Since we don't want to change the internal implementation of column page, decreasing the row number in a page may be the only way to solve the problem. --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2383#discussion_r196487039 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -371,8 +371,13 @@ private void setWritingConfiguration() throws CarbonDataWriterException { this.pageSize = Integer.parseInt(CarbonProperties.getInstance() .getProperty(CarbonCommonConstants.BLOCKLET_SIZE, CarbonCommonConstants.BLOCKLET_SIZE_DEFAULT_VAL)); + // support less than 32000 rows in one page, because we support super long string, + // if it is long enough, a clomun page with 32000 rows will exceed 2GB if (version == ColumnarFormatVersion.V3) { - this.pageSize = CarbonV3DataFormatConstants.NUMBER_OF_ROWS_PER_BLOCKLET_COLUMN_PAGE_DEFAULT; + this.pageSize = --- End diff -- how much is the default value for page size ? --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:
https://github.com/apache/carbondata/pull/2383 @xuchuanyin then number of rows will depend on number of character in long string columns right? --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2383#discussion_r196631555 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -371,8 +371,13 @@ private void setWritingConfiguration() throws CarbonDataWriterException { this.pageSize = Integer.parseInt(CarbonProperties.getInstance() .getProperty(CarbonCommonConstants.BLOCKLET_SIZE, CarbonCommonConstants.BLOCKLET_SIZE_DEFAULT_VAL)); + // support less than 32000 rows in one page, because we support super long string, + // if it is long enough, a clomun page with 32000 rows will exceed 2GB if (version == ColumnarFormatVersion.V3) { - this.pageSize = CarbonV3DataFormatConstants.NUMBER_OF_ROWS_PER_BLOCKLET_COLUMN_PAGE_DEFAULT; + this.pageSize = --- End diff -- In V3, it is 32000 by default. Here we use the min(32000, user_specified) --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2383 @kumarvishal09 If the string is too long, the user have to adjust the page size manually. We cannot do it dynamic for now. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2383 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6399/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2383 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5233/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2383 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5344/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2383 created a jira CARBONDATA-2613 to do this automatically --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:
https://github.com/apache/carbondata/pull/2383 @xuchuanyin Please rebase --- |
Free forum by Nabble | Edit this page |