Apache CarbonData Dev Mailing List archive

Re: Questions about dictionary-encoded column and MDK

Posted by Liang Chen on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Questions-about-dictionary-encoded-column-and-MDK-tp9457p9619.html

Hi william

Exactly! your understanding is pretty correct.

And currently community is developing sort_columns feature, user can
specify columns to make MDK. the PR number is 635. Invite all of you to
review this pr code.

Regards
Liang

2017-03-26 9:15 GMT+05:30 william <[hidden email]>:

> 1. Dictionary encoding make column storage more efficient with small size
> and improved search performance。
> 2. when search,MDK/Min-Max can be used to do block/blocklet prunning in
> oder to reduce IO. For now ,MDK is composed by dimensions with the oder of
> declared in create table statement
>
> On Thu, Mar 23, 2017 at 11:51 PM, Liang Chen <[hidden email]>
> wrote:
>
> > Hi
> >
> > 1.System makes MDK index for dimensions(string columns as dimensions,
> > numeric
> > columns as measures) , so you have to specify at least one
> dimension(string
> > column) for building MDK index.
> >
> > 2.You can set numeric column with DICTIONARY_INCLUDE or
> DICTIONARY_EXCLUDE
> > to
> > build MDK index.
> > For case2, you can change script like :
> > carbon.sql("create table if not exists test(a integer, b integer, c
> > integer)
> > STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");
> >
> > Regards
> > Liang
> >
> > 2017-03-23 18:39 GMT+05:30 Jin Zhou <[hidden email]>:
> >
> > > Exception info:
> > > scala> carbon.sql("create table if not exists test(a integer, b
> integer,
> > c
> > > integer) STORED BY 'carbondata'");
> > > org.apache.carbondata.spark.exception.MalformedCarbonCommandException:
> > > Table
> > > default.test can not be created without key columns. Please use
> > > DICTIONARY_INCLUDE or DICTIONARY_EXCLUDE to set at least one key column
> > if
> > > all specified columns are numeric types
> > > at
> > > org.apache.spark.sql.catalyst.CarbonDDLSqlParser.prepareTableModel(
> > > CarbonDDLSqlParser.scala:240)
> > > at
> > > org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(
> > > CarbonSparkSqlParser.scala:162)
> > > at
> > > org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(
> > > CarbonSparkSqlParser.scala:60)
> > > at
> > > org.apache.spark.sql.catalyst.parser.SqlBaseParser$
> > > CreateTableContext.accept(SqlBaseParser.java:503)
> > > at
> > > org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(
> > > AbstractParseTreeVisitor.java:42)
> > > at
> > > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$
> > > visitSingleStatement$1.apply(AstBuilder.scala:66)
> > > at
> > > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$
> > > visitSingleStatement$1.apply(AstBuilder.scala:66)
> > > at
> > > org.apache.spark.sql.catalyst.parser.ParserUtils$.
> > > withOrigin(ParserUtils.scala:93)
> > > at
> > > org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(
> > > AstBuilder.scala:65)
> > > at
> > > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$
> > > anonfun$parsePlan$1.apply(ParseDriver.scala:54)
> > > at
> > > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$
> > > anonfun$parsePlan$1.apply(ParseDriver.scala:53)
> > > at
> > > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.
> > > parse(ParseDriver.scala:82)
> > > at
> > > org.apache.spark.sql.parser.CarbonSparkSqlParser.parse(
> > > CarbonSparkSqlParser.scala:56)
> > > at
> > > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.
> > > parsePlan(ParseDriver.scala:53)
> > > at
> > > org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(
> > > CarbonSparkSqlParser.scala:46)
> > > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
> > > ... 50 elided
> > >
> > > I didn't notice “if all specified columns are numeric types” in
> exception
> > > info. So I did more tests and found the issue only occurs when all
> > columns
> > > are numeric types.
> > >
> > > Below are cases I tested:
> > > case 1：
> > > carbon.sql("create table if not exists test(a string, b string, c
> string)
> > > STORED BY 'carbondata' 'DICTIONARY_EXCLUDE'='a,b,c' ");
> > > ====> ok, no dictionary column
> > >
> > > case 2：
> > > carbon.sql("create table if not exists test(a integer, b integer, c
> > > integer)
> > > STORED BY 'carbondata'");
> > > ====> fail
> > >
> > > case 3:
> > > carbon.sql("create tale if not exists test(a integer, b integer, c
> > integer)
> > > STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");
> > > ====> ok, at least one dictionary column
> > >
> > > One little problem about case 2 is that there are no proper dictionary
> > > column when all columns have high cardinality.
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-carbondata-
> > > mailing-list-archive.1130556.n5.nabble.com/Questions-about-
> > > dictionary-encoded-column-and-MDK-tp9457p9484.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
> >
> >
> > --
> > Regards
> > Liang
> >
>
>
>
> --
> Best Regards
> _______________________________________________________________
> 开阔视野专注开发
> WilliamZhu 祝海林 [hidden email]
> 产品事业部-基础平台-搜索&数据挖掘
> 手机：18601315052
> MSN：[hidden email]
> 微博：@PrinceCharmingJ http://weibo.com/PrinceCharmingJ
> 地址：北京市朝阳区广顺北大街33号院1号楼福码大厦B座12层
> _______________________________________________________________
> http://www.csdn.net You're the One
> 全球最大中文IT技术社区一切由你开始
>
> http://www.iteye.net
> 程序员深度交流社区
>

--
Regards
Liang