GitHub user manishgupta88 opened a pull request:
https://github.com/apache/carbondata/pull/1715 [CARBONDATA-1934] Incorrect results are returned by select query in case when the number of blocklets for one part file are > 1 in the same task
Problem: When a select query is triggered, driver will prune the segments and give a list of blocklets that need to be scanned. The number of tasks from spark will be equal to the number of blocklets identified.
In case where one task has more than one blocklet for same file, then BlockExecution getting formed is incorrect. Due to this the query results are incorrect.
Fix: Use the abstract index to fill all the details in BlockExecutionInfo
- [ ] Any interfaces changed?
No
- [ ] Any backward compatibility impacted?
No
- [ ] Document update required?
No
- [ ] Testing done
Manual testing
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
NA
You can merge this pull request into a Git repository by running:
$ git pull
https://github.com/manishgupta88/carbondata data_loss_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1715.patchTo close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1715
----
commit b0c518d4aa7d4b2387899deefc0f9ed39b5c463c
Author: manishgupta88 <tomanishgupta18@...>
Date: 2017-12-22T10:35:31Z
Incorrect results are returned by select query in case when the number of blocklets for one part file are > 1 in the same task
----
---