GitHub user manishgupta88 opened a pull request:
https://github.com/apache/carbondata/pull/1077 [CARBONDATA-1213] Removed rowCountPercentage check and fixed IUD data load issue Problems: 1. Row count percentage not required with high cardinality threshold check 2. IUD returning incorrect results in case of update on high cardinality column Analysis: 1. In case a column is identified as high cardinality column still it is not getting converted to no dictionary column because of another parameter check called rowCountPercentage. Default value of rowCountPercentage is 80%. Due to this even though high cardinality column is identified, if it is less than 80% of the total number of rows it will be treated as dictionary column. This can still lead to executor lost failure due to memory constraints. 2. RLE on a column is not being set correctly and due to incorrect code design RLE applicable on a column is decided by a different part of code from the one which is actually applying the RLE on a column. Because of this Footer is getting filled with incorrect RLE information and query is failing. Fix: 1. Remove an unwanted check for rowCountPercentage. 2. RLE applicability on a column should be decided from a common place in the code. You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/incubator-carbondata high_cardinlaity_identification_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/1077.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1077 ---- commit 9c1291dc84b8fc4247a9d6e32d4482685d40325a Author: manishgupta88 <[hidden email]> Date: 2017-06-22T09:07:13Z Problem: 1. Row count percentage not required with high cardinality threshold check 2. IUD returning incorrect results in case of update on high cardinality column Analysis: 1. In case a column is identified as high cardinality column still it is not getting converted to no dictionary column because of another parameter check called rowCountPercentage. Default value of rowCountPercentage is 80%. Due to this even though high cardinality column is identified, if it is less than 80% of the total number of rows it will be treated as dictionary column. This can still lead to executor lost failure due to memory constraints. 2. RLE on a column is not being set correctly and due to incorrect code design RLE applicable on a column is decided by a different part of code from the one which is actually applying the RLE on a column. Because of this Footer is getting filled with incorrect RLE information and query is failing. Fix: 1. Remove an unwanted check for rowCountPercentage. 2. RLE applicability on a column should be decided from a common place in the code. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
Github user asfgit commented on the issue:
https://github.com/apache/carbondata/pull/1077 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1077 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2657/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1077 Build Failed with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/86/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user asfgit commented on the issue:
https://github.com/apache/carbondata/pull/1077 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/carbondata-pr-spark-1.6/572/<h2>Failed Tests: <span class='status-failure'>1</span></h2><h3><a name='carbondata-pr-spark-1.6/org.apache.carbondata:carbondata-spark-common-test' /><a href='https://builds.apache.org/job/carbondata-pr-spark-1.6/572/org.apache.carbondata$carbondata-spark-common-test/testReport'>carbondata-pr-spark-1.6/org.apache.carbondata:carbondata-spark-common-test</a>: <span class='status-failure'>1</span></h3><ul><li><a href='https://builds.apache.org/job/carbondata-pr-spark-1.6/572/org.apache.carbondata$carbondata-spark-common-test/testReport/org.apache.carbondata.spark.testsuite.dataload/TestDataLoadWithColumnsMoreThanSchema/test_for_duplicate_column_name_in_the_Fileheader_options_in_load_command/'><strong>org.apache.carbondata.spark.testsuite.dataload.TestDataLoadWithColumnsMoreThanSchema.test for duplicate column name in the Fileheader options in load command</strong></a></li></ul> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1077 Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/87/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1077 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2658/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user asfgit commented on the issue:
https://github.com/apache/carbondata/pull/1077 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/carbondata-pr-spark-1.6/575/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1077#discussion_r123677612 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/TablePageEncoder.java --- @@ -149,29 +151,26 @@ private void encodeAndCompressDimensions(TablePage tablePage, EncodedData encode switch (dimensionSpec.getType(i)) { case GLOBAL_DICTIONARY: // dictionary dimension - indexStorages[indexStorageOffset] = - encodeAndCompressDictDimension( - tablePage.getDictDimensionPage()[++dictionaryColumnCount].getByteArrayPage(), - isSortColumn, - isUseInvertedIndex[i] & isSortColumn); + indexStorages[indexStorageOffset] = encodeAndCompressDictDimension( + tablePage.getDictDimensionPage()[++dictionaryColumnCount].getByteArrayPage(), + isSortColumn, isUseInvertedIndex[i] & isSortColumn, + CarbonDataProcessorUtil.isRleApplicableForColumn(DimensionType.GLOBAL_DICTIONARY)); flattened = ByteUtil.flatten(indexStorages[indexStorageOffset].getDataPage()); break; case DIRECT_DICTIONARY: // timestamp and date column - indexStorages[indexStorageOffset] = - encodeAndCompressDirectDictDimension( - tablePage.getDictDimensionPage()[++dictionaryColumnCount].getByteArrayPage(), - isSortColumn, - isUseInvertedIndex[i] & isSortColumn); + indexStorages[indexStorageOffset] = encodeAndCompressDirectDictDimension( + tablePage.getDictDimensionPage()[++dictionaryColumnCount].getByteArrayPage(), + isSortColumn, isUseInvertedIndex[i] & isSortColumn, + CarbonDataProcessorUtil.isRleApplicableForColumn(DimensionType.DIRECT_DICTIONARY)); --- End diff -- pass `dimensionSpec.getType(i)` to `CarbonDataProcessorUtil.isRleApplicableForColumn` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1077#discussion_r123677638 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/TablePageEncoder.java --- @@ -149,29 +151,26 @@ private void encodeAndCompressDimensions(TablePage tablePage, EncodedData encode switch (dimensionSpec.getType(i)) { case GLOBAL_DICTIONARY: // dictionary dimension - indexStorages[indexStorageOffset] = - encodeAndCompressDictDimension( - tablePage.getDictDimensionPage()[++dictionaryColumnCount].getByteArrayPage(), - isSortColumn, - isUseInvertedIndex[i] & isSortColumn); + indexStorages[indexStorageOffset] = encodeAndCompressDictDimension( + tablePage.getDictDimensionPage()[++dictionaryColumnCount].getByteArrayPage(), + isSortColumn, isUseInvertedIndex[i] & isSortColumn, + CarbonDataProcessorUtil.isRleApplicableForColumn(DimensionType.GLOBAL_DICTIONARY)); flattened = ByteUtil.flatten(indexStorages[indexStorageOffset].getDataPage()); break; case DIRECT_DICTIONARY: // timestamp and date column - indexStorages[indexStorageOffset] = - encodeAndCompressDirectDictDimension( - tablePage.getDictDimensionPage()[++dictionaryColumnCount].getByteArrayPage(), - isSortColumn, - isUseInvertedIndex[i] & isSortColumn); + indexStorages[indexStorageOffset] = encodeAndCompressDirectDictDimension( + tablePage.getDictDimensionPage()[++dictionaryColumnCount].getByteArrayPage(), + isSortColumn, isUseInvertedIndex[i] & isSortColumn, + CarbonDataProcessorUtil.isRleApplicableForColumn(DimensionType.DIRECT_DICTIONARY)); flattened = ByteUtil.flatten(indexStorages[indexStorageOffset].getDataPage()); break; case PLAIN_VALUE: // high cardinality dimension, encoded as plain string - indexStorages[indexStorageOffset] = - encodeAndCompressNoDictDimension( - tablePage.getNoDictDimensionPage()[++noDictionaryColumnCount].getByteArrayPage(), - isSortColumn, - isUseInvertedIndex[i] & isSortColumn); + indexStorages[indexStorageOffset] = encodeAndCompressNoDictDimension( + tablePage.getNoDictDimensionPage()[++noDictionaryColumnCount].getByteArrayPage(), + isSortColumn, isUseInvertedIndex[i] & isSortColumn, + CarbonDataProcessorUtil.isRleApplicableForColumn(DimensionType.PLAIN_VALUE)); --- End diff -- pass `dimensionSpec.getType(i)` to `CarbonDataProcessorUtil.isRleApplicableForColumn` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on the issue:
https://github.com/apache/carbondata/pull/1077 @ravipesala ...handled review comments...kindly review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1077 Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/92/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1077 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/2664/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user asfgit commented on the issue:
https://github.com/apache/carbondata/pull/1077 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/carbondata-pr-spark-1.6/584/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1077 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user asfgit closed the pull request at:
https://github.com/apache/carbondata/pull/1077 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
Free forum by Nabble | Edit this page |