Questions about dictionary-encoded column and MDK

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions about dictionary-encoded column and MDK

Jin Zhou
Hi,

Recently I'm doing some tests on spark2.1.0+carbondata1.0.0 and have some questions:

1)Exception is thrown when table created without any dictionary column. Does that means carbon table must have at least one dictionary column?

2)What's the connection between dictionary-encoded column and MDK? Does MDK only contains dictionary-encoded column?
Reply | Threaded
Open this post in threaded view
|

Re: Questions about dictionary-encoded column and MDK

Liang Chen
Administrator
Hi

Can you provide your full exception info.

Regards
Liang

2017-03-23 13:54 GMT+05:30 Jin Zhou <[hidden email]>:

> Hi,
>
> Recently I'm doing some tests on spark2.1.0+carbondata1.0.0 and have some
> questions:
>
> 1)Exception is thrown when table created without any dictionary column.
> Does
> that means carbon table must have at least one dictionary column?
>
> 2)What's the connection between dictionary-encoded column and MDK? Does MDK
> only contains dictionary-encoded column?
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Questions-about-
> dictionary-encoded-column-and-MDK-tp9457.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Regards
Liang
Reply | Threaded
Open this post in threaded view
|

Re: Questions about dictionary-encoded column and MDK

Jin Zhou
Exception info:
scala> carbon.sql("create table if not exists test(a integer, b integer, c integer) STORED BY 'carbondata'");
org.apache.carbondata.spark.exception.MalformedCarbonCommandException: Table default.test can not be created without key columns. Please use DICTIONARY_INCLUDE or DICTIONARY_EXCLUDE to set at least one key column if all specified columns are numeric types
  at org.apache.spark.sql.catalyst.CarbonDDLSqlParser.prepareTableModel(CarbonDDLSqlParser.scala:240)
  at org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(CarbonSparkSqlParser.scala:162)
  at org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(CarbonSparkSqlParser.scala:60)
  at org.apache.spark.sql.catalyst.parser.SqlBaseParser$CreateTableContext.accept(SqlBaseParser.java:503)
  at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
  at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66)
  at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66)
  at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:93)
  at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:65)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:54)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
  at org.apache.spark.sql.parser.CarbonSparkSqlParser.parse(CarbonSparkSqlParser.scala:56)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(CarbonSparkSqlParser.scala:46)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  ... 50 elided

I didn't notice “if all specified columns are numeric types” in exception info. So I did more tests and found the issue only occurs when all columns are numeric types.

Below are cases I tested:
case 1:
carbon.sql("create table if not exists test(a string, b string, c string) STORED BY 'carbondata' 'DICTIONARY_EXCLUDE'='a,b,c' ");
====> ok, no dictionary column

case 2:
carbon.sql("create table if not exists test(a integer, b integer, c integer) STORED BY 'carbondata'");
====> fail

case 3:
carbon.sql("create tale if not exists test(a integer, b integer, c integer) STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");
====> ok, at least one dictionary column

One little problem about case 2 is that there are no proper dictionary column when all columns have high cardinality.
Reply | Threaded
Open this post in threaded view
|

Re: Questions about dictionary-encoded column and MDK

Liang Chen
Administrator
Hi

1.System makes MDK index for dimensions(string columns as dimensions, numeric
columns as measures) , so you have to specify at least one dimension(string
column) for building MDK index.

2.You can set numeric column with DICTIONARY_INCLUDE or DICTIONARY_EXCLUDE to
build MDK index.
For case2, you can change script like :
carbon.sql("create table if not exists test(a integer, b integer, c integer)
STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");

Regards
Liang

2017-03-23 18:39 GMT+05:30 Jin Zhou <[hidden email]>:

> Exception info:
> scala> carbon.sql("create table if not exists test(a integer, b integer, c
> integer) STORED BY 'carbondata'");
> org.apache.carbondata.spark.exception.MalformedCarbonCommandException:
> Table
> default.test can not be created without key columns. Please use
> DICTIONARY_INCLUDE or DICTIONARY_EXCLUDE to set at least one key column if
> all specified columns are numeric types
>   at
> org.apache.spark.sql.catalyst.CarbonDDLSqlParser.prepareTableModel(
> CarbonDDLSqlParser.scala:240)
>   at
> org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(
> CarbonSparkSqlParser.scala:162)
>   at
> org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(
> CarbonSparkSqlParser.scala:60)
>   at
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$
> CreateTableContext.accept(SqlBaseParser.java:503)
>   at
> org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(
> AbstractParseTreeVisitor.java:42)
>   at
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$
> visitSingleStatement$1.apply(AstBuilder.scala:66)
>   at
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$
> visitSingleStatement$1.apply(AstBuilder.scala:66)
>   at
> org.apache.spark.sql.catalyst.parser.ParserUtils$.
> withOrigin(ParserUtils.scala:93)
>   at
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(
> AstBuilder.scala:65)
>   at
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$
> anonfun$parsePlan$1.apply(ParseDriver.scala:54)
>   at
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$
> anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.
> parse(ParseDriver.scala:82)
>   at
> org.apache.spark.sql.parser.CarbonSparkSqlParser.parse(
> CarbonSparkSqlParser.scala:56)
>   at
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.
> parsePlan(ParseDriver.scala:53)
>   at
> org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(
> CarbonSparkSqlParser.scala:46)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>   ... 50 elided
>
> I didn't notice “if all specified columns are numeric types” in exception
> info. So I did more tests and found the issue only occurs when all columns
> are numeric types.
>
> Below are cases I tested:
> case 1:
> carbon.sql("create table if not exists test(a string, b string, c string)
> STORED BY 'carbondata' 'DICTIONARY_EXCLUDE'='a,b,c' ");
> ====> ok, no dictionary column
>
> case 2:
> carbon.sql("create table if not exists test(a integer, b integer, c
> integer)
> STORED BY 'carbondata'");
> ====> fail
>
> case 3:
> carbon.sql("create tale if not exists test(a integer, b integer, c integer)
> STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");
> ====> ok, at least one dictionary column
>
> One little problem about case 2 is that there are no proper dictionary
> column when all columns have high cardinality.
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Questions-about-
> dictionary-encoded-column-and-MDK-tp9457p9484.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Regards
Liang
Reply | Threaded
Open this post in threaded view
|

Re: Questions about dictionary-encoded column and MDK

ZhuWilliam
1.  Dictionary encoding make column storage more efficient with small size
and improved search performance。
2.  when search,MDK/Min-Max can be used to do  block/blocklet prunning in
oder to  reduce IO. For now ,MDK is composed by dimensions with the oder of
declared in create table statement

On Thu, Mar 23, 2017 at 11:51 PM, Liang Chen <[hidden email]>
wrote:

> Hi
>
> 1.System makes MDK index for dimensions(string columns as dimensions,
> numeric
> columns as measures) , so you have to specify at least one dimension(string
> column) for building MDK index.
>
> 2.You can set numeric column with DICTIONARY_INCLUDE or DICTIONARY_EXCLUDE
> to
> build MDK index.
> For case2, you can change script like :
> carbon.sql("create table if not exists test(a integer, b integer, c
> integer)
> STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");
>
> Regards
> Liang
>
> 2017-03-23 18:39 GMT+05:30 Jin Zhou <[hidden email]>:
>
> > Exception info:
> > scala> carbon.sql("create table if not exists test(a integer, b integer,
> c
> > integer) STORED BY 'carbondata'");
> > org.apache.carbondata.spark.exception.MalformedCarbonCommandException:
> > Table
> > default.test can not be created without key columns. Please use
> > DICTIONARY_INCLUDE or DICTIONARY_EXCLUDE to set at least one key column
> if
> > all specified columns are numeric types
> >   at
> > org.apache.spark.sql.catalyst.CarbonDDLSqlParser.prepareTableModel(
> > CarbonDDLSqlParser.scala:240)
> >   at
> > org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(
> > CarbonSparkSqlParser.scala:162)
> >   at
> > org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(
> > CarbonSparkSqlParser.scala:60)
> >   at
> > org.apache.spark.sql.catalyst.parser.SqlBaseParser$
> > CreateTableContext.accept(SqlBaseParser.java:503)
> >   at
> > org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(
> > AbstractParseTreeVisitor.java:42)
> >   at
> > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$
> > visitSingleStatement$1.apply(AstBuilder.scala:66)
> >   at
> > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$
> > visitSingleStatement$1.apply(AstBuilder.scala:66)
> >   at
> > org.apache.spark.sql.catalyst.parser.ParserUtils$.
> > withOrigin(ParserUtils.scala:93)
> >   at
> > org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(
> > AstBuilder.scala:65)
> >   at
> > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$
> > anonfun$parsePlan$1.apply(ParseDriver.scala:54)
> >   at
> > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$
> > anonfun$parsePlan$1.apply(ParseDriver.scala:53)
> >   at
> > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.
> > parse(ParseDriver.scala:82)
> >   at
> > org.apache.spark.sql.parser.CarbonSparkSqlParser.parse(
> > CarbonSparkSqlParser.scala:56)
> >   at
> > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.
> > parsePlan(ParseDriver.scala:53)
> >   at
> > org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(
> > CarbonSparkSqlParser.scala:46)
> >   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
> >   ... 50 elided
> >
> > I didn't notice “if all specified columns are numeric types” in exception
> > info. So I did more tests and found the issue only occurs when all
> columns
> > are numeric types.
> >
> > Below are cases I tested:
> > case 1:
> > carbon.sql("create table if not exists test(a string, b string, c string)
> > STORED BY 'carbondata' 'DICTIONARY_EXCLUDE'='a,b,c' ");
> > ====> ok, no dictionary column
> >
> > case 2:
> > carbon.sql("create table if not exists test(a integer, b integer, c
> > integer)
> > STORED BY 'carbondata'");
> > ====> fail
> >
> > case 3:
> > carbon.sql("create tale if not exists test(a integer, b integer, c
> integer)
> > STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");
> > ====> ok, at least one dictionary column
> >
> > One little problem about case 2 is that there are no proper dictionary
> > column when all columns have high cardinality.
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Questions-about-
> > dictionary-encoded-column-and-MDK-tp9457p9484.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Regards
> Liang
>



--
Best Regards
_______________________________________________________________
开阔视野      专注开发
WilliamZhu   祝海林      [hidden email]
产品事业部-基础平台-搜索&数据挖掘
手机:18601315052
MSN:[hidden email]
微博:@PrinceCharmingJ  http://weibo.com/PrinceCharmingJ
地址:北京市朝阳区广顺北大街33号院1号楼福码大厦B座12层
_______________________________________________________________
http://www.csdn.net                              You're the One
全球最大中文IT技术社区                               一切由你开始

http://www.iteye.net
程序员深度交流社区
Reply | Threaded
Open this post in threaded view
|

Re: Questions about dictionary-encoded column and MDK

Liang Chen
Administrator
Hi william

Exactly! your understanding is pretty correct.

And currently community is developing sort_columns feature, user can
specify columns to make MDK. the PR number is 635. Invite all of you to
review this pr code.

Regards
Liang

2017-03-26 9:15 GMT+05:30 william <[hidden email]>:

> 1.  Dictionary encoding make column storage more efficient with small size
> and improved search performance。
> 2.  when search,MDK/Min-Max can be used to do  block/blocklet prunning in
> oder to  reduce IO. For now ,MDK is composed by dimensions with the oder of
> declared in create table statement
>
> On Thu, Mar 23, 2017 at 11:51 PM, Liang Chen <[hidden email]>
> wrote:
>
> > Hi
> >
> > 1.System makes MDK index for dimensions(string columns as dimensions,
> > numeric
> > columns as measures) , so you have to specify at least one
> dimension(string
> > column) for building MDK index.
> >
> > 2.You can set numeric column with DICTIONARY_INCLUDE or
> DICTIONARY_EXCLUDE
> > to
> > build MDK index.
> > For case2, you can change script like :
> > carbon.sql("create table if not exists test(a integer, b integer, c
> > integer)
> > STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");
> >
> > Regards
> > Liang
> >
> > 2017-03-23 18:39 GMT+05:30 Jin Zhou <[hidden email]>:
> >
> > > Exception info:
> > > scala> carbon.sql("create table if not exists test(a integer, b
> integer,
> > c
> > > integer) STORED BY 'carbondata'");
> > > org.apache.carbondata.spark.exception.MalformedCarbonCommandException:
> > > Table
> > > default.test can not be created without key columns. Please use
> > > DICTIONARY_INCLUDE or DICTIONARY_EXCLUDE to set at least one key column
> > if
> > > all specified columns are numeric types
> > >   at
> > > org.apache.spark.sql.catalyst.CarbonDDLSqlParser.prepareTableModel(
> > > CarbonDDLSqlParser.scala:240)
> > >   at
> > > org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(
> > > CarbonSparkSqlParser.scala:162)
> > >   at
> > > org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(
> > > CarbonSparkSqlParser.scala:60)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.SqlBaseParser$
> > > CreateTableContext.accept(SqlBaseParser.java:503)
> > >   at
> > > org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(
> > > AbstractParseTreeVisitor.java:42)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$
> > > visitSingleStatement$1.apply(AstBuilder.scala:66)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$
> > > visitSingleStatement$1.apply(AstBuilder.scala:66)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.ParserUtils$.
> > > withOrigin(ParserUtils.scala:93)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(
> > > AstBuilder.scala:65)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$
> > > anonfun$parsePlan$1.apply(ParseDriver.scala:54)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$
> > > anonfun$parsePlan$1.apply(ParseDriver.scala:53)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.
> > > parse(ParseDriver.scala:82)
> > >   at
> > > org.apache.spark.sql.parser.CarbonSparkSqlParser.parse(
> > > CarbonSparkSqlParser.scala:56)
> > >   at
> > > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.
> > > parsePlan(ParseDriver.scala:53)
> > >   at
> > > org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(
> > > CarbonSparkSqlParser.scala:46)
> > >   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
> > >   ... 50 elided
> > >
> > > I didn't notice “if all specified columns are numeric types” in
> exception
> > > info. So I did more tests and found the issue only occurs when all
> > columns
> > > are numeric types.
> > >
> > > Below are cases I tested:
> > > case 1:
> > > carbon.sql("create table if not exists test(a string, b string, c
> string)
> > > STORED BY 'carbondata' 'DICTIONARY_EXCLUDE'='a,b,c' ");
> > > ====> ok, no dictionary column
> > >
> > > case 2:
> > > carbon.sql("create table if not exists test(a integer, b integer, c
> > > integer)
> > > STORED BY 'carbondata'");
> > > ====> fail
> > >
> > > case 3:
> > > carbon.sql("create tale if not exists test(a integer, b integer, c
> > integer)
> > > STORED BY 'carbondata' TBLPROPERTIES ('DICTIONARY_INCLUDE'='a')");
> > > ====> ok, at least one dictionary column
> > >
> > > One little problem about case 2 is that there are no proper dictionary
> > > column when all columns have high cardinality.
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-carbondata-
> > > mailing-list-archive.1130556.n5.nabble.com/Questions-about-
> > > dictionary-encoded-column-and-MDK-tp9457p9484.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
> >
> >
> > --
> > Regards
> > Liang
> >
>
>
>
> --
> Best Regards
> _______________________________________________________________
> 开阔视野      专注开发
> WilliamZhu   祝海林      [hidden email]
> 产品事业部-基础平台-搜索&数据挖掘
> 手机:18601315052
> MSN:[hidden email]
> 微博:@PrinceCharmingJ  http://weibo.com/PrinceCharmingJ
> 地址:北京市朝阳区广顺北大街33号院1号楼福码大厦B座12层
> _______________________________________________________________
> http://www.csdn.net                              You're the One
> 全球最大中文IT技术社区                               一切由你开始
>
> http://www.iteye.net
> 程序员深度交流社区
>



--
Regards
Liang