Apache CarbonData Dev Mailing List archive

Re: Improving Non-dictionary storage & performance.

Posted by Jacky Li on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improving-Non-dictionary-storage-performance-tp8146p8215.html

Hi Ravindra,

Another suggestion is that, to avoid creating trouble for user while loading, for single-pass, if dictionary key generated for certain column is more than the configured value, then the loading process should stop and log this error explicitly telling the cardinality of all columns.
By doing this, user should know what is the reason causing data load failure.
How about this idea?

Regards,
Jacky

> 在 2017年3月3日，上午1:26，Ravindra Pesala <[hidden email]> 写道：
>
> Hi Likun,
>
> Yes, Likun we better keep dictionary as default until we optimize
> no-dictionary columns.
> As you mentioned we can suggest 2-pass for first load and subsequent loads
> will use single-pass to improve the performance.
>
> Regards,
> Ravindra.
>
> On 2 March 2017 at 06:48, Jacky Li <[hidden email]> wrote:
>
>> Hi Ravindra & Vishal,
>>
>> Yes, I think these works need to be done before switching no-dictionary as
>> default. So as of now, we should use dictionary as default.
>> I think we can suggest user to do loading as:
>> 1. First load: use 2-pass mode to load, the first scan should discover the
>> cardinality, and check with user specified option. We should define rules
>> to pass or fail the validation, and finalize the load option for subsequent
>> load.
>> 2. Subsequent load: use single-pass mode to load, use the options defined
>> by first load
>>
>> What is your idea?
>>
>> Regards,
>> Jacky
>>
>>> 在 2017年3月1日，下午11:31，Ravindra Pesala <[hidden email]> 写道：
>>>
>>> Hi Vishal,
>>>
>>> You are right, thats why we can do no-dictionary only for String
>> datatype.
>>> Please look at my first point. we can always use direct dictionary for
>>> possible data types like short, int, long, double & float for
>> sort_columns.
>>>
>>> Regards,
>>> Ravindra.
>>>
>>> On 1 March 2017 at 18:18, Kumar Vishal <[hidden email]>
>> wrote:
>>>
>>>> Hi Ravi,
>>>> Sorting of data for no dictionary should be based on data type + same
>> for
>>>> filter . Please add this point.
>>>>
>>>> -Regards
>>>> Kumar Vishal
>>>>
>>>> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <[hidden email]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> In order to make non-dictionary columns storage and performance more
>>>>> efficient, I am suggesting following improvements.
>>>>>
>>>>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always direct
>>>>> dictionary.
>>>>> Right now only date and timestamp are direct dictionary columns. We
>>>> can
>>>>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
>>>> columns
>>>>> are included in SORT_COLUMNS
>>>>>
>>>>> 2. Consider delta/value compression while storing direct dictionary
>>>> values.
>>>>> Right now it always uses INT datatype to store direct dictionary
>> values.
>>>> So
>>>>> we can consider value/Delta compression to compact the storage.
>>>>>
>>>>> 3. Use the Separator instead of LV format to store String value in
>>>>> no-dictionary format.
>>>>> Currently String datatypes for non-dictionary colums are stored as
>>>>> LV(length value) format, here we are using Short(2 bytes) as length
>>>> always.
>>>>> In order to keep storage compact we can use separator (0 byte as
>>>> separator)
>>>>> it just takes single byte. And while reading we can traverse through
>> data
>>>>> and get the offsets like we are doing now.
>>>>>
>>>>> 4. Add Range filters for no-dictionary columns.
>>>>> Currently range filters like greater/ less than filters are not
>>>> implemented
>>>>> for no-dictionary columns. So we should implement them to avoid row
>> level
>>>>> filter and improve the performance.
>>>>>
>>>>> Regards,
>>>>> Ravindra.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>
>>
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi