Improving Non-dictionary storage & performance.

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Improving Non-dictionary storage & performance.

ravipesala
Hi,

In order to make non-dictionary columns storage and performance more
efficient, I am suggesting following improvements.

1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct dictionary.
   Right now only date and timestamp are direct dictionary columns. We can
make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these columns
are included in SORT_COLUMNS

2. Consider delta/value compression while storing direct dictionary values.
Right now it always uses INT datatype to store direct dictionary values. So
we can consider value/Delta compression to compact the storage.

3. Use the Separator instead of LV format to store String value in
no-dictionary format.
Currently String datatypes for non-dictionary colums are stored as
LV(length value) format, here we are using Short(2 bytes) as length always.
In order to keep storage compact we can use separator (0 byte as separator)
it just takes single byte. And while reading we can traverse through data
and get the offsets like we are doing now.

4. Add Range filters for no-dictionary columns.
Currently range filters like greater/ less than filters are not implemented
for no-dictionary columns. So we should implement them to avoid row level
filter and improve the performance.

Regards,
Ravindra.
Reply | Threaded
Open this post in threaded view
|

Re: Improving Non-dictionary storage & performance.

kumarvishal09
Hi Ravi,
Sorting of data for no dictionary should be based on data type + same for
filter . Please add this point.

-Regards
Kumar Vishal

On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <[hidden email]>
wrote:

> Hi,
>
> In order to make non-dictionary columns storage and performance more
> efficient, I am suggesting following improvements.
>
> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
> dictionary.
>    Right now only date and timestamp are direct dictionary columns. We can
> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these columns
> are included in SORT_COLUMNS
>
> 2. Consider delta/value compression while storing direct dictionary values.
> Right now it always uses INT datatype to store direct dictionary values. So
> we can consider value/Delta compression to compact the storage.
>
> 3. Use the Separator instead of LV format to store String value in
> no-dictionary format.
> Currently String datatypes for non-dictionary colums are stored as
> LV(length value) format, here we are using Short(2 bytes) as length always.
> In order to keep storage compact we can use separator (0 byte as separator)
> it just takes single byte. And while reading we can traverse through data
> and get the offsets like we are doing now.
>
> 4. Add Range filters for no-dictionary columns.
> Currently range filters like greater/ less than filters are not implemented
> for no-dictionary columns. So we should implement them to avoid row level
> filter and improve the performance.
>
> Regards,
> Ravindra.
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: Improving Non-dictionary storage & performance.

ravipesala
Hi Vishal,

You are right, thats why we can do no-dictionary only for String datatype.
Please look at my first point. we can always use direct dictionary for
possible data types like short, int, long, double & float for sort_columns.

Regards,
Ravindra.

On 1 March 2017 at 18:18, Kumar Vishal <[hidden email]> wrote:

> Hi Ravi,
> Sorting of data for no dictionary should be based on data type + same for
> filter . Please add this point.
>
> -Regards
> Kumar Vishal
>
> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <[hidden email]>
> wrote:
>
> > Hi,
> >
> > In order to make non-dictionary columns storage and performance more
> > efficient, I am suggesting following improvements.
> >
> > 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
> > dictionary.
> >    Right now only date and timestamp are direct dictionary columns. We
> can
> > make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
> columns
> > are included in SORT_COLUMNS
> >
> > 2. Consider delta/value compression while storing direct dictionary
> values.
> > Right now it always uses INT datatype to store direct dictionary values.
> So
> > we can consider value/Delta compression to compact the storage.
> >
> > 3. Use the Separator instead of LV format to store String value in
> > no-dictionary format.
> > Currently String datatypes for non-dictionary colums are stored as
> > LV(length value) format, here we are using Short(2 bytes) as length
> always.
> > In order to keep storage compact we can use separator (0 byte as
> separator)
> > it just takes single byte. And while reading we can traverse through data
> > and get the offsets like we are doing now.
> >
> > 4. Add Range filters for no-dictionary columns.
> > Currently range filters like greater/ less than filters are not
> implemented
> > for no-dictionary columns. So we should implement them to avoid row level
> > filter and improve the performance.
> >
> > Regards,
> > Ravindra.
> >
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Improving Non-dictionary storage & performance.

Jacky Li
Hi Ravindra & Vishal,

Yes, I think these works need to be done before switching no-dictionary as default. So as of now, we should use dictionary as default.
I think we can suggest user to do loading as:
1. First load: use 2-pass mode to load, the first scan should discover the cardinality, and check with user specified option. We should define rules to pass or fail the validation, and finalize the load option for subsequent load.
2. Subsequent load: use single-pass mode to load, use the options defined by first load

What is your idea?

Regards,
Jacky

> 在 2017年3月1日,下午11:31,Ravindra Pesala <[hidden email]> 写道:
>
> Hi Vishal,
>
> You are right, thats why we can do no-dictionary only for String datatype.
> Please look at my first point. we can always use direct dictionary for
> possible data types like short, int, long, double & float for sort_columns.
>
> Regards,
> Ravindra.
>
> On 1 March 2017 at 18:18, Kumar Vishal <[hidden email]> wrote:
>
>> Hi Ravi,
>> Sorting of data for no dictionary should be based on data type + same for
>> filter . Please add this point.
>>
>> -Regards
>> Kumar Vishal
>>
>> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <[hidden email]>
>> wrote:
>>
>>> Hi,
>>>
>>> In order to make non-dictionary columns storage and performance more
>>> efficient, I am suggesting following improvements.
>>>
>>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
>>> dictionary.
>>>   Right now only date and timestamp are direct dictionary columns. We
>> can
>>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
>> columns
>>> are included in SORT_COLUMNS
>>>
>>> 2. Consider delta/value compression while storing direct dictionary
>> values.
>>> Right now it always uses INT datatype to store direct dictionary values.
>> So
>>> we can consider value/Delta compression to compact the storage.
>>>
>>> 3. Use the Separator instead of LV format to store String value in
>>> no-dictionary format.
>>> Currently String datatypes for non-dictionary colums are stored as
>>> LV(length value) format, here we are using Short(2 bytes) as length
>> always.
>>> In order to keep storage compact we can use separator (0 byte as
>> separator)
>>> it just takes single byte. And while reading we can traverse through data
>>> and get the offsets like we are doing now.
>>>
>>> 4. Add Range filters for no-dictionary columns.
>>> Currently range filters like greater/ less than filters are not
>> implemented
>>> for no-dictionary columns. So we should implement them to avoid row level
>>> filter and improve the performance.
>>>
>>> Regards,
>>> Ravindra.
>>>
>>
>
>
> --
> Thanks & Regards,
> Ravi



Reply | Threaded
Open this post in threaded view
|

Re: Improving Non-dictionary storage & performance.

ravipesala
Hi Likun,

Yes, Likun we better keep dictionary as default until we optimize
no-dictionary columns.
As you mentioned we can suggest 2-pass for first load and subsequent loads
will use single-pass to improve the performance.

Regards,
Ravindra.

On 2 March 2017 at 06:48, Jacky Li <[hidden email]> wrote:

> Hi Ravindra & Vishal,
>
> Yes, I think these works need to be done before switching no-dictionary as
> default. So as of now, we should use dictionary as default.
> I think we can suggest user to do loading as:
> 1. First load: use 2-pass mode to load, the first scan should discover the
> cardinality, and check with user specified option. We should define rules
> to pass or fail the validation, and finalize the load option for subsequent
> load.
> 2. Subsequent load: use single-pass mode to load, use the options defined
> by first load
>
> What is your idea?
>
> Regards,
> Jacky
>
> > 在 2017年3月1日,下午11:31,Ravindra Pesala <[hidden email]> 写道:
> >
> > Hi Vishal,
> >
> > You are right, thats why we can do no-dictionary only for String
> datatype.
> > Please look at my first point. we can always use direct dictionary for
> > possible data types like short, int, long, double & float for
> sort_columns.
> >
> > Regards,
> > Ravindra.
> >
> > On 1 March 2017 at 18:18, Kumar Vishal <[hidden email]>
> wrote:
> >
> >> Hi Ravi,
> >> Sorting of data for no dictionary should be based on data type + same
> for
> >> filter . Please add this point.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <[hidden email]>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> In order to make non-dictionary columns storage and performance more
> >>> efficient, I am suggesting following improvements.
> >>>
> >>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
> >>> dictionary.
> >>>   Right now only date and timestamp are direct dictionary columns. We
> >> can
> >>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
> >> columns
> >>> are included in SORT_COLUMNS
> >>>
> >>> 2. Consider delta/value compression while storing direct dictionary
> >> values.
> >>> Right now it always uses INT datatype to store direct dictionary
> values.
> >> So
> >>> we can consider value/Delta compression to compact the storage.
> >>>
> >>> 3. Use the Separator instead of LV format to store String value in
> >>> no-dictionary format.
> >>> Currently String datatypes for non-dictionary colums are stored as
> >>> LV(length value) format, here we are using Short(2 bytes) as length
> >> always.
> >>> In order to keep storage compact we can use separator (0 byte as
> >> separator)
> >>> it just takes single byte. And while reading we can traverse through
> data
> >>> and get the offsets like we are doing now.
> >>>
> >>> 4. Add Range filters for no-dictionary columns.
> >>> Currently range filters like greater/ less than filters are not
> >> implemented
> >>> for no-dictionary columns. So we should implement them to avoid row
> level
> >>> filter and improve the performance.
> >>>
> >>> Regards,
> >>> Ravindra.
> >>>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Improving Non-dictionary storage & performance.

Jacky Li
Hi Ravindra,

Another suggestion is that, to avoid creating trouble for user while loading, for single-pass, if dictionary key generated for certain column is more than the configured value, then the loading process should stop and log this error explicitly telling the cardinality of all columns.
By doing this, user should know what is the reason causing data load failure.
How about this idea?

Regards,
Jacky

> 在 2017年3月3日,上午1:26,Ravindra Pesala <[hidden email]> 写道:
>
> Hi Likun,
>
> Yes, Likun we better keep dictionary as default until we optimize
> no-dictionary columns.
> As you mentioned we can suggest 2-pass for first load and subsequent loads
> will use single-pass to improve the performance.
>
> Regards,
> Ravindra.
>
> On 2 March 2017 at 06:48, Jacky Li <[hidden email]> wrote:
>
>> Hi Ravindra & Vishal,
>>
>> Yes, I think these works need to be done before switching no-dictionary as
>> default. So as of now, we should use dictionary as default.
>> I think we can suggest user to do loading as:
>> 1. First load: use 2-pass mode to load, the first scan should discover the
>> cardinality, and check with user specified option. We should define rules
>> to pass or fail the validation, and finalize the load option for subsequent
>> load.
>> 2. Subsequent load: use single-pass mode to load, use the options defined
>> by first load
>>
>> What is your idea?
>>
>> Regards,
>> Jacky
>>
>>> 在 2017年3月1日,下午11:31,Ravindra Pesala <[hidden email]> 写道:
>>>
>>> Hi Vishal,
>>>
>>> You are right, thats why we can do no-dictionary only for String
>> datatype.
>>> Please look at my first point. we can always use direct dictionary for
>>> possible data types like short, int, long, double & float for
>> sort_columns.
>>>
>>> Regards,
>>> Ravindra.
>>>
>>> On 1 March 2017 at 18:18, Kumar Vishal <[hidden email]>
>> wrote:
>>>
>>>> Hi Ravi,
>>>> Sorting of data for no dictionary should be based on data type + same
>> for
>>>> filter . Please add this point.
>>>>
>>>> -Regards
>>>> Kumar Vishal
>>>>
>>>> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <[hidden email]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> In order to make non-dictionary columns storage and performance more
>>>>> efficient, I am suggesting following improvements.
>>>>>
>>>>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
>>>>> dictionary.
>>>>>  Right now only date and timestamp are direct dictionary columns. We
>>>> can
>>>>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
>>>> columns
>>>>> are included in SORT_COLUMNS
>>>>>
>>>>> 2. Consider delta/value compression while storing direct dictionary
>>>> values.
>>>>> Right now it always uses INT datatype to store direct dictionary
>> values.
>>>> So
>>>>> we can consider value/Delta compression to compact the storage.
>>>>>
>>>>> 3. Use the Separator instead of LV format to store String value in
>>>>> no-dictionary format.
>>>>> Currently String datatypes for non-dictionary colums are stored as
>>>>> LV(length value) format, here we are using Short(2 bytes) as length
>>>> always.
>>>>> In order to keep storage compact we can use separator (0 byte as
>>>> separator)
>>>>> it just takes single byte. And while reading we can traverse through
>> data
>>>>> and get the offsets like we are doing now.
>>>>>
>>>>> 4. Add Range filters for no-dictionary columns.
>>>>> Currently range filters like greater/ less than filters are not
>>>> implemented
>>>>> for no-dictionary columns. So we should implement them to avoid row
>> level
>>>>> filter and improve the performance.
>>>>>
>>>>> Regards,
>>>>> Ravindra.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>
>>
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi



Reply | Threaded
Open this post in threaded view
|

Re: Improving Non-dictionary storage & performance.

bill.zhou
hi Jacky
    I think this is not easy for user to control if cabron is online running. May be for one table two different load can be different cardinality for the same column but user cannot give different dictionary columns for one table.

Regards

Jacky Li wrote
Hi Ravindra,

Another suggestion is that, to avoid creating trouble for user while loading, for single-pass, if dictionary key generated for certain column is more than the configured value, then the loading process should stop and log this error explicitly telling the cardinality of all columns.
By doing this, user should know what is the reason causing data load failure.
How about this idea?

Regards,
Jacky

> 在 2017年3月3日,上午1:26,Ravindra Pesala <[hidden email]> 写道:
>
> Hi Likun,
>
> Yes, Likun we better keep dictionary as default until we optimize
> no-dictionary columns.
> As you mentioned we can suggest 2-pass for first load and subsequent loads
> will use single-pass to improve the performance.
>
> Regards,
> Ravindra.
>
> On 2 March 2017 at 06:48, Jacky Li <[hidden email]> wrote:
>
>> Hi Ravindra & Vishal,
>>
>> Yes, I think these works need to be done before switching no-dictionary as
>> default. So as of now, we should use dictionary as default.
>> I think we can suggest user to do loading as:
>> 1. First load: use 2-pass mode to load, the first scan should discover the
>> cardinality, and check with user specified option. We should define rules
>> to pass or fail the validation, and finalize the load option for subsequent
>> load.
>> 2. Subsequent load: use single-pass mode to load, use the options defined
>> by first load
>>
>> What is your idea?
>>
>> Regards,
>> Jacky
>>
>>> 在 2017年3月1日,下午11:31,Ravindra Pesala <[hidden email]> 写道:
>>>
>>> Hi Vishal,
>>>
>>> You are right, thats why we can do no-dictionary only for String
>> datatype.
>>> Please look at my first point. we can always use direct dictionary for
>>> possible data types like short, int, long, double & float for
>> sort_columns.
>>>
>>> Regards,
>>> Ravindra.
>>>
>>> On 1 March 2017 at 18:18, Kumar Vishal <[hidden email]>
>> wrote:
>>>
>>>> Hi Ravi,
>>>> Sorting of data for no dictionary should be based on data type + same
>> for
>>>> filter . Please add this point.
>>>>
>>>> -Regards
>>>> Kumar Vishal
>>>>
>>>> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <[hidden email]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> In order to make non-dictionary columns storage and performance more
>>>>> efficient, I am suggesting following improvements.
>>>>>
>>>>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
>>>>> dictionary.
>>>>>  Right now only date and timestamp are direct dictionary columns. We
>>>> can
>>>>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
>>>> columns
>>>>> are included in SORT_COLUMNS
>>>>>
>>>>> 2. Consider delta/value compression while storing direct dictionary
>>>> values.
>>>>> Right now it always uses INT datatype to store direct dictionary
>> values.
>>>> So
>>>>> we can consider value/Delta compression to compact the storage.
>>>>>
>>>>> 3. Use the Separator instead of LV format to store String value in
>>>>> no-dictionary format.
>>>>> Currently String datatypes for non-dictionary colums are stored as
>>>>> LV(length value) format, here we are using Short(2 bytes) as length
>>>> always.
>>>>> In order to keep storage compact we can use separator (0 byte as
>>>> separator)
>>>>> it just takes single byte. And while reading we can traverse through
>> data
>>>>> and get the offsets like we are doing now.
>>>>>
>>>>> 4. Add Range filters for no-dictionary columns.
>>>>> Currently range filters like greater/ less than filters are not
>>>> implemented
>>>>> for no-dictionary columns. So we should implement them to avoid row
>> level
>>>>> filter and improve the performance.
>>>>>
>>>>> Regards,
>>>>> Ravindra.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>
>>
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Improving Non-dictionary storage & performance.

bill.zhou
In reply to this post by ravipesala
hi Ravindra

  The column which type is double or float always is measure, the cardinality is high, so if default dictionary it will cause performance problem I think.

   
1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct dictionary.
   Right now only date and timestamp are direct dictionary columns. We can
make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these columns
are included in SORT_COLUMNS
 
Regards
Bill
ravipesala wrote
Hi,

In order to make non-dictionary columns storage and performance more
efficient, I am suggesting following improvements.

1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct dictionary.
   Right now only date and timestamp are direct dictionary columns. We can
make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these columns
are included in SORT_COLUMNS

2. Consider delta/value compression while storing direct dictionary values.
Right now it always uses INT datatype to store direct dictionary values. So
we can consider value/Delta compression to compact the storage.

3. Use the Separator instead of LV format to store String value in
no-dictionary format.
Currently String datatypes for non-dictionary colums are stored as
LV(length value) format, here we are using Short(2 bytes) as length always.
In order to keep storage compact we can use separator (0 byte as separator)
it just takes single byte. And while reading we can traverse through data
and get the offsets like we are doing now.

4. Add Range filters for no-dictionary columns.
Currently range filters like greater/ less than filters are not implemented
for no-dictionary columns. So we should implement them to avoid row level
filter and improve the performance.

Regards,
Ravindra.
Reply | Threaded
Open this post in threaded view
|

Re: Improving Non-dictionary storage & performance.

David CaiQiang
In reply to this post by ravipesala
+1

I agree.

About non-dictionary column of sort_columns:
1. sort column data in ColumnChunk2

2. compress column by datatype
string: RLE or snappy (if RLE is not good)
short, int, bigint: Delta and number compressor (ValueCompressor and NumberCompressor)
float, double:  Delta and snappy (ValueCompressor and SnappyCompressor)

3. store column by datatype:
string :  byte[], use null character separator
short, int, bigint: byte[], use max/min to calculate a fixed length to store delta value
float, double: byte[], uncompressed to float[] or double[]

4. filter column
column level: ExcludeFilterExecuterImpl, IncludeFilterExecuterImpl, RangeFilterExecuter
RangeFilterExecuter of column level should calculate the index range(start and end) of sorted data chunk to get bitset of uncompressed result.

@Ravindra please correct me

Best Regards
David Cai