http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improving-Non-dictionary-storage-performance-tp8146p8215.html
Another suggestion is that, to avoid creating trouble for user while loading, for single-pass, if dictionary key generated for certain column is more than the configured value, then the loading process should stop and log this error explicitly telling the cardinality of all columns.
> 在 2017年3月3日,上午1:26,Ravindra Pesala <
[hidden email]> 写道:
>
> Hi Likun,
>
> Yes, Likun we better keep dictionary as default until we optimize
> no-dictionary columns.
> As you mentioned we can suggest 2-pass for first load and subsequent loads
> will use single-pass to improve the performance.
>
> Regards,
> Ravindra.
>
> On 2 March 2017 at 06:48, Jacky Li <
[hidden email]> wrote:
>
>> Hi Ravindra & Vishal,
>>
>> Yes, I think these works need to be done before switching no-dictionary as
>> default. So as of now, we should use dictionary as default.
>> I think we can suggest user to do loading as:
>> 1. First load: use 2-pass mode to load, the first scan should discover the
>> cardinality, and check with user specified option. We should define rules
>> to pass or fail the validation, and finalize the load option for subsequent
>> load.
>> 2. Subsequent load: use single-pass mode to load, use the options defined
>> by first load
>>
>> What is your idea?
>>
>> Regards,
>> Jacky
>>
>>> 在 2017年3月1日,下午11:31,Ravindra Pesala <
[hidden email]> 写道:
>>>
>>> Hi Vishal,
>>>
>>> You are right, thats why we can do no-dictionary only for String
>> datatype.
>>> Please look at my first point. we can always use direct dictionary for
>>> possible data types like short, int, long, double & float for
>> sort_columns.
>>>
>>> Regards,
>>> Ravindra.
>>>
>>> On 1 March 2017 at 18:18, Kumar Vishal <
[hidden email]>
>> wrote:
>>>
>>>> Hi Ravi,
>>>> Sorting of data for no dictionary should be based on data type + same
>> for
>>>> filter . Please add this point.
>>>>
>>>> -Regards
>>>> Kumar Vishal
>>>>
>>>> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <
[hidden email]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> In order to make non-dictionary columns storage and performance more
>>>>> efficient, I am suggesting following improvements.
>>>>>
>>>>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always direct
>>>>> dictionary.
>>>>> Right now only date and timestamp are direct dictionary columns. We
>>>> can
>>>>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
>>>> columns
>>>>> are included in SORT_COLUMNS
>>>>>
>>>>> 2. Consider delta/value compression while storing direct dictionary
>>>> values.
>>>>> Right now it always uses INT datatype to store direct dictionary
>> values.
>>>> So
>>>>> we can consider value/Delta compression to compact the storage.
>>>>>
>>>>> 3. Use the Separator instead of LV format to store String value in
>>>>> no-dictionary format.
>>>>> Currently String datatypes for non-dictionary colums are stored as
>>>>> LV(length value) format, here we are using Short(2 bytes) as length
>>>> always.
>>>>> In order to keep storage compact we can use separator (0 byte as
>>>> separator)
>>>>> it just takes single byte. And while reading we can traverse through
>> data
>>>>> and get the offsets like we are doing now.
>>>>>
>>>>> 4. Add Range filters for no-dictionary columns.
>>>>> Currently range filters like greater/ less than filters are not
>>>> implemented
>>>>> for no-dictionary columns. So we should implement them to avoid row
>> level
>>>>> filter and improve the performance.
>>>>>
>>>>> Regards,
>>>>> Ravindra.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>
>>
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi