GitHub user xuchuanyin opened a pull request:
https://github.com/apache/carbondata/pull/2169 [CARBONDATA-2344][DataMap] Fix bugs in mapping blocklet to UnsafeDMStore rows In BlockletDataMap, carbondata stores DMRow in an array for each blocklet. But currently carbondata accesses the DMRow only by blockletId(0, 1, etc.), which will cause problem since different block can have same blockletId. This PR adds a map to map the blockId#blockletId to array index, carbondata can access the DMRow by blockId and blockletId. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [x] Any interfaces changed? `NO, only internal interfaces have been changed` - [x] Any backward compatibility impacted? `NO` - [x] Document update required? `NO` - [x] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? `NO` - How it is tested? Please attach test report. `Tested in local` - Is it a performance related change? Please attach the performance test report. `No` - Any additional information to help reviewers in testing this change. `NO` - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. `Not related` You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuchuanyin/carbondata 0413_bug_blocklet_dm_unsafe_row Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2169.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2169 ---- commit dd010297c7f7428dc8f42ec1a292b8cdddcc09aa Author: xuchuanyin <xuchuanyin@...> Date: 2018-04-13T08:18:23Z Fix bugs in mapping blocklet to UnsafeDMStore In BlockletDataMap, carbondata stores DMRow in an array for each blocklet. But currently carbondata accesses the DMRow only by blockletId(0, 1, etc.), which will cause problem since different block can have same blockletId. This PR adds a map to map the blockId#blockletId to array index, carbondata can access the DMRow by blockId and blockletId. ---- --- |
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/3780/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4996/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2169 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4440/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2169 retest this please --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2169 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4441/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5007/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/3791/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2169 retest this please --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/3867/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5091/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2169 @xuchuanyin what is the issue you are actually facing? Blocklet ids here are only virtual and count as per the number of blocklets present in the indexfile. If the issue is with other datamaps like lucene then better correct the blocklet order as per the indexfile while writing the datamap. It also saves memory and simplifies the datamap writing by avoiding block name. Maintaining block names here is not memory efficient. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2169 @ravipesala Thanks for helping me understand the design purpose. The origin problem is that I found the query result will duplicate/miss some records. The scenario is that I use a datamap to filter out 2 block (each contains 3 blocklets). When it comes to BlockletDataMap, it filter out 6 blocklets, but the blocklets are duplicated twice. Actually it only contains blocklets from the first block. I'll work on the relativeBlockletId and fix the problem. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/3941/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5226/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2169 @ravipesala After I studied the code, I found that we must keep a map between unique-blockletId to DMRow-pointer-index. The relative blockletId in previous code was generated before datamap pruning and has some relationship with DMRow-pointer-index. After pruning, some blocks will be filtered and we can't get the real relative blocklet since some blocks was filtered. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2169 retest this please --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2169 @xuchuanyin What I have mentioned is that instead of adding the mapping in datamap, handle while writing the datamap. Currently the blocklet number is respective to each block while writing the datamap , instead generate blocklet number respective to complete index file. In this approach, we can eliminate the block to bloclet mapping completely even inside datamaps. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4078/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2169 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5259/ --- |
Free forum by Nabble | Edit this page |