Manish Gupta created CARBONDATA-542:
---------------------------------------
Summary: Parsing values for measures and dimensions during data load should adopt a strict check
Key: CARBONDATA-542
URL:
https://issues.apache.org/jira/browse/CARBONDATA-542 Project: CarbonData
Issue Type: Improvement
Reporter: Manish Gupta
Assignee: Manish Gupta
Priority: Minor
Fix For: 1.0.0-incubating
Currently in carbon we treat Short and Int as long and at the time of storing in carbon data files delta compression is used which compresses the data based on min and max values of the column.
While parsing the values for these datatypes, we use Double data type parser and extract long value from that. Code snippet as below. Double.valueOf(msrValue).longValue()
This has the following problems.
1. Measure Values beyond the range of Int and Short are parsed successfully. This behavior conflicts when the same measure is included as dictionary_include and becomes a dimension. When we query then each dimension value is parsed for its datatype for result conversion and at that time NumberFormatException is thrown and null is displayed in the result while for measure the loaded values are displayed. This also impacts aggregate queries. That is why strict check mechanism is adopted for dimensions values parsing.
2. Data inconsistency in case of measures as for decimal values, the value before decimal will only be considered for Int and Short datatypes.
3. For measures, if values beyond the datatype range are allowed the compression will decrease.
Therefore we will have to adopt a strict behavior for both dimensions and measures.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)