[Discussion] Parsing values during data load should adopt a strict check or lenient check mechanism

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discussion] Parsing values during data load should adopt a strict check or lenient check mechanism

manishgupta88
Hi All,

Currently in carbon we treat Short and Int as long and at the time of
storing in carbon data files delta compression is used which compresses the
data based on min and max values of the column.

While parsing the values for these datatypes, we use Double data type
parser and extract long value from that. Code snippet as below.
Double.valueOf(msrValue).longValue()

This has the following problems.

1. Measure Values beyond the range of Int and Short are parsed
successfully. This behavior conflicts when the same measure is included as
dictionary_include and becomes a dimension. When we query then each
dimension value is parsed for its datatype for result conversion and at
that time NumberFormatException is thrown and null is displayed in the
result while for measure the loaded values are displayed. This also impacts
aggregate queries. That is why strict check mechanism is adopted for
dimensions values parsing.

2. Data inconsistency  in case of measures as for decimal values, the value
before decimal will only be considered for Int and Short datatypes.

3. For measures, if values beyond the datatype range are allowed the
compression will decrease.

Please comment as what should be the parsing behavior. Carbon should adopt
a strict check mechanism or lenient check mechanism considering that the
behavior should be same for both dimensions and measures as both are
finally table columns.

Regards
Manish Gupta
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Parsing values during data load should adopt a strict check or lenient check mechanism

Liang Chen
Administrator
Hi

Thank you started a good discussion.

I propose to do strict check mechanism to avoid these problems what you mentioned in the below.
And the behavior should be same for both dimensions and measures. In a word , need to process the actual data type as per users input.

Regards
Liang

manishgupta88 wrote
Hi All,

Currently in carbon we treat Short and Int as long and at the time of
storing in carbon data files delta compression is used which compresses the
data based on min and max values of the column.

While parsing the values for these datatypes, we use Double data type
parser and extract long value from that. Code snippet as below.
Double.valueOf(msrValue).longValue()

This has the following problems.

1. Measure Values beyond the range of Int and Short are parsed
successfully. This behavior conflicts when the same measure is included as
dictionary_include and becomes a dimension. When we query then each
dimension value is parsed for its datatype for result conversion and at
that time NumberFormatException is thrown and null is displayed in the
result while for measure the loaded values are displayed. This also impacts
aggregate queries. That is why strict check mechanism is adopted for
dimensions values parsing.

2. Data inconsistency  in case of measures as for decimal values, the value
before decimal will only be considered for Int and Short datatypes.

3. For measures, if values beyond the datatype range are allowed the
compression will decrease.

Please comment as what should be the parsing behavior. Carbon should adopt
a strict check mechanism or lenient check mechanism considering that the
behavior should be same for both dimensions and measures as both are
finally table columns.

Regards
Manish Gupta
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Parsing values during data load should adopt a strict check or lenient check mechanism

Aniket Adnaik
+1 for strict check mechanism.
User should see a consistent behavior for both dimension and measure
columns.

Best Regards,
Aniket


On Tue, Dec 6, 2016 at 6:03 PM, Liang Chen <[hidden email]> wrote:

> Hi
>
> Thank you started a good discussion.
>
> I propose to do strict check mechanism to avoid these problems what you
> mentioned in the below.
> And the behavior should be same for both dimensions and measures. In a word
> , need to process the actual data type as per users input.
>
> Regards
> Liang
>
>
> manishgupta88 wrote
> > Hi All,
> >
> > Currently in carbon we treat Short and Int as long and at the time of
> > storing in carbon data files delta compression is used which compresses
> > the
> > data based on min and max values of the column.
> >
> > While parsing the values for these datatypes, we use Double data type
> > parser and extract long value from that. Code snippet as below.
> > Double.valueOf(msrValue).longValue()
> >
> > This has the following problems.
> >
> > 1. Measure Values beyond the range of Int and Short are parsed
> > successfully. This behavior conflicts when the same measure is included
> as
> > dictionary_include and becomes a dimension. When we query then each
> > dimension value is parsed for its datatype for result conversion and at
> > that time NumberFormatException is thrown and null is displayed in the
> > result while for measure the loaded values are displayed. This also
> > impacts
> > aggregate queries. That is why strict check mechanism is adopted for
> > dimensions values parsing.
> >
> > 2. Data inconsistency  in case of measures as for decimal values, the
> > value
> > before decimal will only be considered for Int and Short datatypes.
> >
> > 3. For measures, if values beyond the datatype range are allowed the
> > compression will decrease.
> >
> > Please comment as what should be the parsing behavior. Carbon should
> adopt
> > a strict check mechanism or lenient check mechanism considering that the
> > behavior should be same for both dimensions and measures as both are
> > finally table columns.
> >
> > Regards
> > Manish Gupta
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-
> Parsing-values-during-data-load-should-adopt-a-strict-
> check-or-lenient-check-mechanism-tp3826p3893.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Parsing values during data load should adopt a strict check or lenient check mechanism

manishgupta88
Hi All,

As per suggestions I have raised a jira to track this issue.

https://issues.apache.org/jira/browse/CARBONDATA-542

Regards
Manish Gupta