Apache CarbonData Dev Mailing List archive

[Discussion] CarbonReader performance improvement

Posted by kunalkapoor on Oct 28, 2018; 6:33pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-CarbonReader-performance-improvement-tp66650.html

Hi All,
I would like to propose some improvements to CarbonReader implementation to
increase the performance.

1. When filter expression is not provided by the user then instead of
calling getSplits method we can list the carbondata files and treat one
file as one split. This would improve the performance as the time in
loading block/blocklet datamap would be avoided.

2. Implement Vectorized Reader and expose a API for the user to switch
between CarbonReader/Vectorized reader. Additionally an API would be
provided for the user to extract the columnar batch instead of rows. This
would allow the user to have a deeper integration with carbon.
Additionally the reduction in method calls for vector reader would improve
the read time.

3. Add concurrent reading functionality to Carbon Reader. This can be
enabled by passing the number of splits required by the user. If the user
passes 2 as the split for reader then the user would be returned 2
CarbonReaders with equal number of RecordReaders in each.
The user can then run each CarbonReader instance in a separate thread to
read the data concurrently.

The performance report would be shared soon.

Any suggestion from the community is greatly appreciated.

Thanks
Kunal Kapoor