Login  Register

Re: [Discussion] Parsing values during data load should adopt a strict check or lenient check mechanism

Posted by Liang Chen on Dec 07, 2016; 2:03am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Parsing-values-during-data-load-should-adopt-a-strict-check-or-lenient-check-mechanism-tp3826p3893.html

Hi

Thank you started a good discussion.

I propose to do strict check mechanism to avoid these problems what you mentioned in the below.
And the behavior should be same for both dimensions and measures. In a word , need to process the actual data type as per users input.

Regards
Liang

manishgupta88 wrote
Hi All,

Currently in carbon we treat Short and Int as long and at the time of
storing in carbon data files delta compression is used which compresses the
data based on min and max values of the column.

While parsing the values for these datatypes, we use Double data type
parser and extract long value from that. Code snippet as below.
Double.valueOf(msrValue).longValue()

This has the following problems.

1. Measure Values beyond the range of Int and Short are parsed
successfully. This behavior conflicts when the same measure is included as
dictionary_include and becomes a dimension. When we query then each
dimension value is parsed for its datatype for result conversion and at
that time NumberFormatException is thrown and null is displayed in the
result while for measure the loaded values are displayed. This also impacts
aggregate queries. That is why strict check mechanism is adopted for
dimensions values parsing.

2. Data inconsistency  in case of measures as for decimal values, the value
before decimal will only be considered for Int and Short datatypes.

3. For measures, if values beyond the datatype range are allowed the
compression will decrease.

Please comment as what should be the parsing behavior. Carbon should adopt
a strict check mechanism or lenient check mechanism considering that the
behavior should be same for both dimensions and measures as both are
finally table columns.

Regards
Manish Gupta