Ravindra Pesala created CARBONDATA-742:
------------------------------------------
Summary: Add batch sort to improve the loading performance
Key: CARBONDATA-742
URL:
https://issues.apache.org/jira/browse/CARBONDATA-742 Project: CarbonData
Issue Type: Improvement
Reporter: Ravindra Pesala
Hi,
Current Problem:
Sort step is major issue as it is blocking step. It needs to receive all data and write down the sort temp files to disk, after that only data writer step can start.
Solution:
Make sort step as non blocking step so it avoids waiting of Data writer step.
Process the data in sort step in batches with size of in-memory capability of the machine. For suppose if machine can allocate 4 GB to process data in-memory, then Sort step can sorts the data with batch size of 2GB and gives it to the data writer step. By the time data writer step consumes the data, sort step receives and sorts the data. So here all steps are continuously working and absolutely there is no disk IO in sort step.
So there would not be any waiting of data writer step for sort step, As and when sort step sorts the data in memory data writer can start writing it.
It can significantly improves the performance.
Advantages:
Increases the loading performance as there is no intermediate IO and no blocking of Sort step.
There is no extra effort for compaction, the current flow can handle it.
Disadvantages:
Number of driver side btrees will increase. So the memory might increase but it could be controlled by current LRU cache implementation.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)