Github user kunal642 commented on the issue:
https://github.com/apache/carbondata/pull/3046 @xuchuanyin The problem was that when using varchar column with email data the key for the dictionary map is very huge. When fallback happens the same data is kept in memory twice, which causes the application to throw OOM exception. As this is a minor version therefore we do not want to expose this property to user, we can take in the next major version. Plus parquet also has a size based limitation mechanism which ensures the size does not grow more than what the system can handle. This PR is just to add the size based limitation so that the map size can be controlled. --- |
In reply to this post by qiuchenjian-2
Github user kunal642 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3046#discussion_r245555497 --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java --- @@ -2076,4 +2076,15 @@ private CarbonCommonConstants() { */ public static final String CARBON_QUERY_DATAMAP_BLOOM_CACHE_SIZE_DEFAULT_VAL = "512"; + public static final String CARBON_LOCAL_DICTIONARY_MAX_THRESHOLD = --- End diff -- changed to CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2186/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10442/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2402/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2188/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10444/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2404/ --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3046#discussion_r245630539 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java --- @@ -1491,6 +1491,27 @@ private void validateSortMemorySpillPercentage() { } } + public int getMaxDictionaryThreshold() { + int localDictionaryMaxThreshold = Integer.parseInt(carbonProperties + .getProperty(CarbonCommonConstants.CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD, + CarbonCommonConstants.CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD_DEFAULT)); + if (localDictionaryMaxThreshold --- End diff -- add min check also --- |
In reply to this post by qiuchenjian-2
Github user kunal642 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3046#discussion_r245653091 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/DecoderBasedFallbackEncoder.java --- @@ -57,10 +57,7 @@ public DecoderBasedFallbackEncoder(EncodedColumnPage encodedColumnPage, int page int pageSize = encodedColumnPage.getActualPage().getPageSize(); int offset = 0; - int[] reverseInvertedIndex = new int[pageSize]; --- End diff -- In case of No_Sort where inverted index is not there, this variable was being created unnecessarily. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2204/ --- |
In reply to this post by qiuchenjian-2
Github user kunal642 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3046#discussion_r245877242 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java --- @@ -1491,6 +1491,27 @@ private void validateSortMemorySpillPercentage() { } } + public int getMaxDictionaryThreshold() { + int localDictionaryMaxThreshold = Integer.parseInt(carbonProperties + .getProperty(CarbonCommonConstants.CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD, + CarbonCommonConstants.CARBON_LOCAL_DICTIONARY_MAX_SIZE_THRESHOLD_DEFAULT)); + if (localDictionaryMaxThreshold --- End diff -- ok --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2421/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/3046 Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10461/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/3046#discussion_r246056127 --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java --- @@ -2076,4 +2076,15 @@ private CarbonCommonConstants() { */ public static final String CARBON_QUERY_DATAMAP_BLOOM_CACHE_SIZE_DEFAULT_VAL = "512"; + public static final String CARBON_LOCAL_DICTIONARY_MAX_THRESHOLD = --- End diff -- It still failed to make it clear that what kind of size it supposed to be since we have a storage size and a counting size. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/3046 > This PR is just to add the size based limitation so that the map size can be controlled. @kunal642 Yeah, I noticed that. So my proposal is that please make a reservation for minimal changes when we want to implement that feature (automatically size detect and fall back) later. --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:
https://github.com/apache/carbondata/pull/3046 @xuchuanyin In near future we are planing to change threshold(currently based on number) to size based local dictionary. Size based threshold will give more control. Current changes in the PR is helping in doing that. Later Just have to expose the table property in create table command for user to control the size threshold. Also didn't get the meaning of your comment. these changes are minimal now also. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/3046 We do not need to expose this threshold to the user. Instead, we can judge ourselves in carbondata. Step1. We can get the size of non-dictionary-encoded page (say M) and the size of dictionary-encoded page (say N). Step2: if M/N >=1 (or M/N >= 0.9), we can fallback automatically. Parquet (maybe ORC) behaves like this. --- |
Free forum by Nabble | Edit this page |