Apache CarbonData Dev Mailing List archive - Re: Discussion regrading design of data load after kettle removal.

Apache CarbonData Dev Mailing List archive

Re: Discussion regrading design of data load after kettle removal.

Posted by Jacky Li on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-regrading-design-of-data-load-after-kettle-removal-tp1672p1783.html

Hi Ravindra,

Regarding the design (https://drive.google.com/file/d/0B4TWTVbFSTnqTF85anlDOUQ5S1BqYzFpLWcwZnBLSVVqSWpj/view), I have following question:

1. In SortProcessorStep, I think it is better to include MergeSort in this step also, so it includes all logic for sorting. In this case, developer can implement a external sort (spill to files only if necessary), then the loading process is a on-line sorting if memory is sufficient. I think it will improve loading performance a lot.

2. In EncoderProcessorStep, apart from the dictionary encoding, what other processing it will do? How about delta, RLE, etc.

3. In InputProcessorStep, it needs some schema definition to parse the input and convert to the row, right? For example, how to read from JSON, AVRO file?

Regards,
Jacky