Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2814: [WIP][CARBONDATA-3001] configurable page size...

Classic

List

Threaded

51 messages Options

123

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2814

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1071/

---

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2814

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/875/

---

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2814

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9140/

---

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2814

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1073/

---

qiuchenjian-2

[GitHub] carbondata pull request #2814: [WIP][CARBONDATA-3001] configurable page size...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2814#discussion_r226567910

--- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ---
@@ -221,50 +224,106 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException {

/**
* Check if column page can be added more rows after adding this row to page.
+ * only few no-dictionary dimensions columns (string, varchar,
+ * complex columns) can grow huge in size.
*
- * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data
- * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from
- * column page and flatten into byte[] for compression.
- * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page.
*
- * Another limitation is from Compressor. Currently we use snappy as default compressor,
- * and it will call MaxCompressedLength method to estimate the result size for preparing output.
- * For safety, the estimate result is oversize: `32 + source_len + source_len/6`.
- * So the maximum bytes to compress by snappy is (2GB-32)*6/7â1.71GB.
- *
- * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer.
- * Such that we can stop adding more row here if any long string column reach this limit.
- *
- * If use unsafe column page, please ensure the memory configured is enough.
- * @param row
- * @return false if any varchar column page cannot add one more value(2MB)
+ * @param row carbonRow
+ * @return false if next rows can be added to same page.
+ * true if next rows cannot be added to same page
*/
- private boolean isVarcharColumnFull(CarbonRow row) {
- //TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB
- if (model.getVarcharDimIdxInNoDict().size() > 0) {
+ private boolean needToCutThePage(CarbonRow row) {
--- End diff --

I'm afraid that in common scenarios even we do not face the page size problems and play in the safe area, carbondata will still call this method to check the boundaries, which will cause data loading performance decreasing.
So is there a way to avoid unnecessary checking here?

In my opinion, to determine the upper bound of the number of rows in a page, the default strategy is 'number based' (32000 as the upper bound). Now you are adding another additional strategy 'capacity based' (xxMB as the upper bound).

There can be multiple strategies for per load, the default is `[number based]`, but the user can also configure `[number based, capacity based]`. So before loading, we can get the strategies and apply them while processing. At the same time, if the strategies is `[number based]`, we do not need to check the capacity, thus avoiding the problem I mentioned above.

Note that we store the rowId in each page using short, it means that the `number based` strategy is a default yet required strategy.
Also, by default, the `capacity based` strategy is not configured. As for this strategy, user can add it in:
1. TBLProperties in creating table
2. Options in loading data
3. Options in SdkWriter
4. Options in creating table using spark file format
5. Options in DataFrameWriter
By all means, we should not configure it in system property, because only few of tables use this feature. However adding it in system property will decrease their loading performance.

---

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2814

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9201/

---

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2814: [WIP][CARBONDATA-3001] configurable page size in MB

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata pull request #2814: [WIP][CARBONDATA-3001] configurable page size...

In reply to this post by qiuchenjian-2

Github user ajantha-bhat commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2814#discussion_r227667228

--- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ---
@@ -221,50 +224,106 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException {

/**
* Check if column page can be added more rows after adding this row to page.
+ * only few no-dictionary dimensions columns (string, varchar,
+ * complex columns) can grow huge in size.
*
- * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data
- * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from
- * column page and flatten into byte[] for compression.
- * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page.
*
- * Another limitation is from Compressor. Currently we use snappy as default compressor,
- * and it will call MaxCompressedLength method to estimate the result size for preparing output.
- * For safety, the estimate result is oversize: `32 + source_len + source_len/6`.
- * So the maximum bytes to compress by snappy is (2GB-32)*6/7â1.71GB.
- *
- * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer.
- * Such that we can stop adding more row here if any long string column reach this limit.
- *
- * If use unsafe column page, please ensure the memory configured is enough.
- * @param row
- * @return false if any varchar column page cannot add one more value(2MB)
+ * @param row carbonRow
+ * @return false if next rows can be added to same page.
+ * true if next rows cannot be added to same page
*/
- private boolean isVarcharColumnFull(CarbonRow row) {
- //TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB
- if (model.getVarcharDimIdxInNoDict().size() > 0) {
+ private boolean needToCutThePage(CarbonRow row) {
--- End diff --

@xuchuanyin :
a. Yes made this as a table properties.
b. We need to keep this validataion for string columns also (can grow upto 1.8 GB), if that page can be fit into cache it can give better read performance.
c. There is no impact on load performance by checking this on each row. because no extra computation happening. Just few checks for each row based on data type.

TODO: find a default value and set it. if page size is not configured. working on it. will handle in same PR

@ravipesala , @xuchuanyin : please check.

---

123