Apache CarbonData Dev Mailing List archive

Re: Improving Non-dictionary storage & performance.

Posted by ravipesala on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Improving-Non-dictionary-storage-performance-tp8146p8202.html

Hi Likun,

Yes, Likun we better keep dictionary as default until we optimize
no-dictionary columns.
As you mentioned we can suggest 2-pass for first load and subsequent loads
will use single-pass to improve the performance.

Regards,
Ravindra.

On 2 March 2017 at 06:48, Jacky Li <[hidden email]> wrote:

> Hi Ravindra & Vishal,
>
> Yes, I think these works need to be done before switching no-dictionary as
> default. So as of now, we should use dictionary as default.
> I think we can suggest user to do loading as:
> 1. First load: use 2-pass mode to load, the first scan should discover the
> cardinality, and check with user specified option. We should define rules
> to pass or fail the validation, and finalize the load option for subsequent
> load.
> 2. Subsequent load: use single-pass mode to load, use the options defined
> by first load
>
> What is your idea?
>
> Regards,
> Jacky
>
> > 在 2017年3月1日，下午11:31，Ravindra Pesala <[hidden email]> 写道：
> >
> > Hi Vishal,
> >
> > You are right, thats why we can do no-dictionary only for String
> datatype.
> > Please look at my first point. we can always use direct dictionary for
> > possible data types like short, int, long, double & float for
> sort_columns.
> >
> > Regards,
> > Ravindra.
> >
> > On 1 March 2017 at 18:18, Kumar Vishal <[hidden email]>
> wrote:
> >
> >> Hi Ravi,
> >> Sorting of data for no dictionary should be based on data type + same
> for
> >> filter . Please add this point.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <[hidden email]>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> In order to make non-dictionary columns storage and performance more
> >>> efficient, I am suggesting following improvements.
> >>>
> >>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always direct
> >>> dictionary.
> >>> Right now only date and timestamp are direct dictionary columns. We
> >> can
> >>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
> >> columns
> >>> are included in SORT_COLUMNS
> >>>
> >>> 2. Consider delta/value compression while storing direct dictionary
> >> values.
> >>> Right now it always uses INT datatype to store direct dictionary
> values.
> >> So
> >>> we can consider value/Delta compression to compact the storage.
> >>>
> >>> 3. Use the Separator instead of LV format to store String value in
> >>> no-dictionary format.
> >>> Currently String datatypes for non-dictionary colums are stored as
> >>> LV(length value) format, here we are using Short(2 bytes) as length
> >> always.
> >>> In order to keep storage compact we can use separator (0 byte as
> >> separator)
> >>> it just takes single byte. And while reading we can traverse through
> data
> >>> and get the offsets like we are doing now.
> >>>
> >>> 4. Add Range filters for no-dictionary columns.
> >>> Currently range filters like greater/ less than filters are not
> >> implemented
> >>> for no-dictionary columns. So we should implement them to avoid row
> level
> >>> filter and improve the performance.
> >>>
> >>> Regards,
> >>> Ravindra.
> >>>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>

--
Thanks & Regards,
Ravi