Hi All,
Currently in carbon we treat Short and Int as long and at the time of storing in carbon data files delta compression is used which compresses the data based on min and max values of the column. While parsing the values for these datatypes, we use Double data type parser and extract long value from that. Code snippet as below. Double.valueOf(msrValue).longValue() This has the following problems. 1. Measure Values beyond the range of Int and Short are parsed successfully. This behavior conflicts when the same measure is included as dictionary_include and becomes a dimension. When we query then each dimension value is parsed for its datatype for result conversion and at that time NumberFormatException is thrown and null is displayed in the result while for measure the loaded values are displayed. This also impacts aggregate queries. That is why strict check mechanism is adopted for dimensions values parsing. 2. Data inconsistency in case of measures as for decimal values, the value before decimal will only be considered for Int and Short datatypes. 3. For measures, if values beyond the datatype range are allowed the compression will decrease. Please comment as what should be the parsing behavior. Carbon should adopt a strict check mechanism or lenient check mechanism considering that the behavior should be same for both dimensions and measures as both are finally table columns. Regards Manish Gupta |
Administrator
|
Hi
Thank you started a good discussion. I propose to do strict check mechanism to avoid these problems what you mentioned in the below. And the behavior should be same for both dimensions and measures. In a word , need to process the actual data type as per users input. Regards Liang
|
+1 for strict check mechanism.
User should see a consistent behavior for both dimension and measure columns. Best Regards, Aniket On Tue, Dec 6, 2016 at 6:03 PM, Liang Chen <[hidden email]> wrote: > Hi > > Thank you started a good discussion. > > I propose to do strict check mechanism to avoid these problems what you > mentioned in the below. > And the behavior should be same for both dimensions and measures. In a word > , need to process the actual data type as per users input. > > Regards > Liang > > > manishgupta88 wrote > > Hi All, > > > > Currently in carbon we treat Short and Int as long and at the time of > > storing in carbon data files delta compression is used which compresses > > the > > data based on min and max values of the column. > > > > While parsing the values for these datatypes, we use Double data type > > parser and extract long value from that. Code snippet as below. > > Double.valueOf(msrValue).longValue() > > > > This has the following problems. > > > > 1. Measure Values beyond the range of Int and Short are parsed > > successfully. This behavior conflicts when the same measure is included > as > > dictionary_include and becomes a dimension. When we query then each > > dimension value is parsed for its datatype for result conversion and at > > that time NumberFormatException is thrown and null is displayed in the > > result while for measure the loaded values are displayed. This also > > impacts > > aggregate queries. That is why strict check mechanism is adopted for > > dimensions values parsing. > > > > 2. Data inconsistency in case of measures as for decimal values, the > > value > > before decimal will only be considered for Int and Short datatypes. > > > > 3. For measures, if values beyond the datatype range are allowed the > > compression will decrease. > > > > Please comment as what should be the parsing behavior. Carbon should > adopt > > a strict check mechanism or lenient check mechanism considering that the > > behavior should be same for both dimensions and measures as both are > > finally table columns. > > > > Regards > > Manish Gupta > > > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/Discussion- > Parsing-values-during-data-load-should-adopt-a-strict- > check-or-lenient-check-mechanism-tp3826p3893.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. > |
Hi All,
As per suggestions I have raised a jira to track this issue. https://issues.apache.org/jira/browse/CARBONDATA-542 Regards Manish Gupta |
Free forum by Nabble | Edit this page |