Login  Register

[DISCUSSION]Improve Simple updates and delete performance in carbondata

Posted by akashnilugal@gmail.com on Nov 19, 2020; 2:11pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Improve-Simple-updates-and-delete-performance-in-carbondata-tp103294.html

Hi Community,

Carbondata supports update and delete using spark. So basically update is
delete + Insert, and delete is just delete
But we use spark APIs or actions on collections that use spark jobs to do
them, like map, partition etc
So Spark adds overhead of task serialization cost, total job execution in
remote nodes, shuffle etc
So even just for simple updates, Carbon takes a lot of time, and the same
for delete as well due to these overheads.

Carbondata 2.1.0 supports update and delete for SDK. This is implemented at
the carbon file format level

so we can reuse the same for simple updates and deletes and avoid spark
completely and can perform simple update

and delete on transactional tables using simple java code. This helps to
avoid all the overhead of spark and make

updates and deletes faster.

I have added an initial V1 design document, please check and give
comments/inputs/suggestions.

https://docs.google.com/document/d/1-M6xPKZG8l6yAu0c9qo3jdUKhpXHWgUR-h8HeUUmk8M/edit?usp=sharing

Thanks,

Regards,
Akash R Nilugal