Apache CarbonData Dev Mailing List archive

Re: carbon data performance doubts

Posted by Swapnil Shinde on Jul 20, 2017; 5:53am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbon-data-performance-doubts-tp18438p18547.html

Thank you, Manish.
Is dictionary exclude supported for datatypes other than String?
https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d12290680d388a43b3/integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L706

-
Swapnil

On Wed, Jul 19, 2017 at 10:44 PM, manishgupta88 <[hidden email]>
wrote:

> Hi Swapnil
>
> Please find my answers inline.
>
> 1. What is the use of *carbon.number.of.cores *property and how is it
> different from spark's executor cores?
>
> -carbon.number.of.cores is used for reading the footer and header of the
> carbondata file during query execution. Spark executor cores is a property
> of spark and controlled by spark for parallelizing the tasks. After task
> distribution each task will further open the number of threads in parallel
> specified as carbon.number.of.cores to read carbondata file footer and
> header and it is managed by carbon code.
>
> 2. Documentation says, by default, all non-numeric columns (except complex
> types) become dimensions and numeric columns become measure. How dimensions
> and measure columns are handled diferently? What are the pros and cons of
> keeping any column as dimension vs measure?
>
> - Dimensions will by default taking part in sorting the complete data from
> left to right as well as because its a columnar storage each dimension will
> further be sorted. On the other hand measure neither take part in sorting
> the data nor they are individually sorted.
> - Because dimensions are sorted it helps to get faster results for filter
> queries by performing binary search.
>
> 3. What is the best way when we have a ID INT column which is will be used
> heavily for filteration/agg/joins but can't be dimension by default.
> Documentation says to include these kind of numeric columns with
> "dictionay_include" or "dictionary_exclude" in table definition so that
> column will be considered as dimenstion. It is not supported to keep
> non-string data types as "dictionary_exclude" (link
> <https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d1229068
> 0d388a43b3/integration/spark-common/src/main/scala/org/
> apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L690>)
> Then do we have to enable dictionary encoding for ID INT columns which is
> beneficial to encode.
>
> -- In the current system best way is to include the IT column as dictionary
> include if the cardinality of column is less or dictionary exclude if
> cardinality of column is high. Measure filter optimization has already been
> implemented in branch 1.1
> (https://github.com/apache/carbondata/commits/branch-1.1) and will be
> available in the coming releases (1.2 or 1.3).
> For your reference you can go through the PR-1124
> (https://github.com/apache/carbondata/pull/1124)
>
> 4. How MDK gets generated and how can we alter it? Any API to find out MDK
> for given table?
>
> -- Only dictionary Include columns take part in generation of MDKey. MDkey
> is generated based on the cardinality of the column. It is one of the data
> compression techniques to reduce the storage space in carbondata storage.
> Computation example:
> Number of bytes for each integer value - 4
> Total number of rows - 100000
> Total umber of bytes - 100000*4
> Cardinality of column(total number of unique values of a column) - 5
> As cardinality is only 5 and we store only the unique values for a
> dictionary column, 5 unique values require total 3 bits for storage. But we
> take minimum storage unit as byte so we can consider here 1 byte for
> storing
> 5 unique values. So we have reduced space from 4 byte to 1 byte for each
> primitive integer value. This is the concept of MDKey.
>
> - You cannot alter an MDKey after table creation. MDKey will be created in
> the order you have specified the dictionary columns during table creation.
>
> - For MDKey generation logic you can check the class
> MultiDimKeyVarLengthGenerator
>
> Regards
> Manish Gupta
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-
> tp18438p18523.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>