Apache CarbonData Dev Mailing List archive

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Classic

List

Threaded

10 messages Options

xm_zzc

Mar 26, 2017; 10:35pm

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Hi, does this version support for the updating and deleting with spark-2.1? Seems like it does not support, what time is it planned to support it?

------------------ Original ------------------

From: "ravipesala [via Apache CarbonData Mailing List archive]";<[hidden email]>;

Date: Sun, Mar 26, 2017 01:16 PM

To: "恩爸"<[hidden email]>;

Subject: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Hi All,

As planned we are going to release Apache CarbonData-1.1.0. Please discuss
and vote for it to initiate 1.1.0 release, i will start to prepare the
release after 3-days of discussion. It will have following features.

1. Introduced new data format called V3(version 3).

Improves the sequential IO by keeping larger size blocklets.So read
larger data at once to memory.
Introduced pages with size of 32000 each for every column inside
blocklet. And min/max is maintained for each page to improve the filter
queries.
Improved compression/decompression of row pages.
Our all performance is improved by 50% compare to old format as per TPC-H
benchmark results.

2. Alter table support in carbondata. (Only for Spark 2.1)

Support renaming of existing table.
Support adding of new column.
Support removing of new column.
Support Upcasting(Ex: from smallint to int) of datatype

3. Supported Batch Sort to improve dataloading performance.

It makes sort step as non blocking step and capable of sorting whole
batch in memory and converts to carbondata file.

4. Improved Single pass load by upgrading to latest netty framework and
launched dictionary client for each loading

5. Supported range filters to combine the between filters to one filter to
improve the filter performance.

6. Apart from features many bugs and improvements are done in this release.

--
Thanks & Regards,
Ravindra

If you reply to this email, your message will be added to the discussion below:

http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9623.html

To start a new topic under Apache CarbonData Mailing List archive, email [hidden email]
To unsubscribe from Apache CarbonData Mailing List archive, click here.
NAML

Liang Chen

Mar 27, 2017; 3:18am

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Administrator

Hi

Yes, update and delete feature with spark-2.x, will be supported after 1.1.0.
As planed , 1.2 would support it or earlier.

Regards
Liang

xm_zzc wrote

Hi, does this version support for the updating and deleting with spark-2.1? Seems like it does not support, what time is it planned to support it?

------------------ Original ------------------
From: "ravipesala [via Apache CarbonData Mailing List archive]";<[hidden email]>;
Date: Sun, Mar 26, 2017 01:16 PM
To: "恩爸"<[hidden email]>;

Subject: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Hi All,

As planned we are going to release Apache CarbonData-1.1.0. Please discuss
and vote for it to initiate 1.1.0 release, i will start to prepare the
release after 3-days of discussion. It will have following features.

1. Introduced new data format called V3(version 3).

Improves the sequential IO by keeping larger size blocklets.So read
larger data at once to memory.
Introduced pages with size of 32000 each for every column inside
blocklet. And min/max is maintained for each page to improve the filter
queries.
Improved compression/decompression of row pages.
Our all performance is improved by 50% compare to old format as per TPC-H
benchmark results.

2. Alter table support in carbondata. (Only for Spark 2.1)

Support renaming of existing table.
Support adding of new column.
Support removing of new column.
Support Upcasting(Ex: from smallint to int) of datatype

3. Supported Batch Sort to improve dataloading performance.

It makes sort step as non blocking step and capable of sorting whole
batch in memory and converts to carbondata file.

4. Improved Single pass load by upgrading to latest netty framework and
launched dictionary client for each loading

5. Supported range filters to combine the between filters to one filter to
improve the filter performance.

6. Apart from features many bugs and improvements are done in this release.

--
Thanks & Regards,
Ravindra

If you reply to this email, your message will be added to the discussion below:
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9623.html
To start a new topic under Apache CarbonData Mailing List archive, email [hidden email]
To unsubscribe from Apache CarbonData Mailing List archive, click here.
NAML
... [show rest of quote]

... [show rest of quote]

xm_zzc

Mar 27, 2017; 4:00am

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Hi, Liang:
Thanks for your reply.

simafengyun

Mar 27, 2017; 5:31am

question about dimension's sort order in blocklet level

This post was updated on Mar 28, 2017; 8:04am.

CONTENTS DELETED

The author has deleted this message.

Liang Chen

Mar 27, 2017; 5:49am

Re: question about dimension's sort order in blocklet level

Administrator

Hi

Please create a new mailing list discussion for your topic.
Please provide all columns' cardinality.

For high cardinality column, system doesn't do dictionary
-------------------------------------------------------
##threshold to identify whether high cardinality column
#high.cardinality.threshold=1000000

Regards
Liang

simafengyun wrote

Hi DEV,

I create table according to the below SQL

cc.sql("""
CREATE TABLE IF NOT EXISTS t3
(ID Int, date Timestamp, country String,
name String, phonetype String, serialname String, salary Int,
name1 String, name2 String, name3 String, name4 String, name5 String, name6 String,name7 String,name8 String
)
STORED BY 'carbondata'
""")

after I load data to this table, I found the dimension columns "name" and "name7" both have no dictionary encode.
column "name" has no inverted index but column "name7" has inverted index
questions:
1. why by default they have no dictionary decode and some have no inverted index？
2. is there any document to introduce these loading strategies?
3. the dimension column "name" has no inverted index, does its' data still have order in DataChunk2 blocklet?
4. as I know, usually dimension column data is sorted and stored in DataChunk2 blocklet.
which cases the dimension column data are not sorted in DataChunk2 blocklet except user specify the column with no inverted index?

5. as I know the first column of mdk key is always sorted in DataChunk2 blocklet, why not set the isExplicitSorted to true?
... [show rest of quote]

... [show rest of quote]

simafengyun

Mar 27, 2017; 9:02am

question about dimension's sort order in blocklet level

Hi DEV,

I create table according to the below SQL

cc.sql("""

CREATE TABLE IF NOT EXISTS t3

(ID Int,

date Timestamp,

country String,

name String,

phonetype String,

serialname String,

salary Int,

name1 String,

name2 String,

name3 String,

name4 String,

name5 String,

name6 String,

name7 String,

name8 String

)

STORED BY 'carbondata'

""")

data cardinality as below.

|

column cardinality

|
|

name

|

name1

|

name2

|

name3

|

name4

|

name5

|

name6

|

name7

|

name8

|
|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

after I load data to this table, I found the dimension columns "name" and "name7" both have no dictionary encode.

but column "name" has no inverted index and column "name7" has inverted index

questions:

1. the dimension column name has dictionary decode, but have no inverted index, does its' data still have order in DataChunk2 blocklet?

2. is there any document to introduce these loading strategies?

3. if a dimension column has no dictionary decode and no inverted index, user also didn't specify the column with no inverted index when create table
does its' data still have order in DataChunk2 blocklet?

4. as I know, by default, all dimension column data are sorted and stored in DataChunk2 blocklet except user specify the column with no inverted index, right?

5. as I know the first dimension column of mdk key is always sorted in DataChunk2 blocklet, why not set the isExplicitSorted to true?

the attached is used to generate the data.csv

package test;

import java.io.BufferedOutputStream;

import java.io.File;

import java.io.FileOutputStream;

import java.io.FileWriter;

import java.util.HashMap;

import java.util.Map;

publicclass CreateData {

public CreateData() {

}

publicstaticvoid main(String[] args) {

FileOutputStream out = null;

FileOutputStream outSTr = null;

BufferedOutputStream Buff = null;

FileWriter fw = null;

intcount = 1000;// 写文件行数

try {

outSTr = new FileOutputStream(new File("data.csv"));

Buff = new BufferedOutputStream(outSTr);

longbegin0 = System.currentTimeMillis();

Buff.write(

"ID,date,country,name,phonetype,serialname,salary,name1,name2,name3,name4,name5,name6,name7,name8\n"

.getBytes());

intidcount = 10000000;

intdatecount = 30;

intcountrycount = 5;

// intnamecount =5000000;

intphonetypecount = 10000;

intserialnamecount = 50000;

// intsalarycount = 200000;

Map<Integer, String> countryMap = new HashMap<Integer, String>();

countryMap.put(1, "usa");

countryMap.put(2, "uk");

countryMap.put(3, "china");

countryMap.put(4, "indian");

countryMap.put(0, "canada");

StringBuilder sb = null;

for (inti = idcount; i >= 0; i--) {

sb = new StringBuilder();

sb.append(4000000 + i).append(",");// id

sb.append("2015/8/" + (i % datecount + 1)).append(",");

sb.append(countryMap.get(i % countrycount)).append(",");

sb.append("name" + (1600000 - i)).append(",");// name

sb.append("phone" + i % phonetypecount).append(",");

sb.append("serialname" + (100000 + i % serialnamecount)).append(",");// serialname

sb.append(i + 500000).append(",");

sb.append("name1" + (i + 100000)).append(",");// name

sb.append("name2" + (i + 200000)).append(",");// name

sb.append("name3" + (i + 300000)).append(",");// name

sb.append("name4" + (i + 400000)).append(",");// name

sb.append("name5" + (i + 500000)).append(",");// name

sb.append("name6" + (i + 600000)).append(",");// name

sb.append("name7" + (i + 700000)).append(",");// name

sb.append("name8" + (i + 800000)).append(",").append('\n');

Buff.write(sb.toString().getBytes());

}

Buff.flush();

Buff.close();

System.out.println("sb.toString():" + sb.toString());

longend0 = System.currentTimeMillis();

System.out.println("BufferedOutputStream执行耗时:" + (end0 - begin0) + " 豪秒");

} catch (Exception e) {

e.printStackTrace();

}

finally {

try {

// fw.close();

Buff.close();

outSTr.close();

// out.close();

} catch (Exception e) {

e.printStackTrace();

}

}

}

}

kumarvishal09

Mar 27, 2017; 9:11am

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

In reply to this post by xm_zzc

+1
-Regards
Kumar Vishal

On Mar 27, 2017 09:31, "xm_zzc" <[hidden email]> wrote:

> Hi, Liang:
> Thanks for your reply.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION-
> Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

kumar vishal

manishgupta88

Mar 27, 2017; 9:58am

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

+1

Regards
Manish Gupta

On Mon, Mar 27, 2017 at 2:41 PM, Kumar Vishal <[hidden email]>
wrote:

> +1
> -Regards
> Kumar Vishal
>
> On Mar 27, 2017 09:31, "xm_zzc" <[hidden email]> wrote:
>
> > Hi, Liang:
> > Thanks for your reply.
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION-
> > Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>

... [show rest of quote]

Liang Chen

Mar 28, 2017; 4:53am

question about dimension's sort order in blocklet level

Administrator

This post was updated on Mar 28, 2017; 4:58am.

In reply to this post by simafengyun

Hi

Can you provide one table to show your info, can't see very clear?

The column of high cardinality(>1000000) would not do dictionary.

Regards
Liang

2017-03-27 14:32 GMT+05:30 马云 <simafengyun1984@163.com>:

> Hi DEV,
>
> I create table according to the below SQL
>
> cc.sql("""
>
> CREATE TABLE IF NOT EXISTS t3
>
> (ID Int,
>
> date Timestamp,
>
> country String,
>
> name String,
>
> phonetype String,
>
> serialname String,
>
> salary Int,
>
> name1 String,
>
> name2 String,
>
> name3 String,
>
> name4 String,
>
> name5 String,
>
> name6 String,
>
> name7 String,
>
> name8 String
>
> )
>
> STORED BY 'carbondata'
>
> """)
>
>
>
> data cardinality as below.
>
> |
>
> column cardinality
>
> |
> |
>
> name
>
> |
>
> name1
>
> |
>
> name2
>
> |
>
> name3
>
> |
>
> name4
>
> |
>
> name5
>
> |
>
> name6
>
> |
>
> name7
>
> |
>
> name8
>
> |
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
>
>
> after I load data to this table, I found the dimension columns "name" and
> "name7" both have no dictionary encode.
>
> but column "name" has no inverted index and column "name7" has inverted
> index
>
> questions:
>
> 1. the dimension column name has dictionary decode, but have no inverted
> index, does its' data still have order in DataChunk2 blocklet?
>
> 2. is there any document to introduce these loading strategies?
>
>
> 3. if a dimension column has no dictionary decode and no inverted
> index, user also didn't specify the column with no inverted index when
> create table
> does its' data still have order in DataChunk2 blocklet?
>
> 4. as I know, by default, all dimension column data are sorted and stored
> in DataChunk2 blocklet except user specify the column with no inverted
> index, right?
>
> 5. as I know the first dimension column of mdk key is always sorted in
> DataChunk2 blocklet, why not set the isExplicitSorted to true?
>
>
>
> the attached is used to generate the data.csv
>
> package test;
>
>
>
>
> import java.io.BufferedOutputStream;
>
> import java.io.File;
>
> import java.io.FileOutputStream;
>
> import java.io.FileWriter;
>
> import java.util.HashMap;
>
> import java.util.Map;
>
>
>
>
> publicclass CreateData {
>
>
>
>
> public CreateData() {
>
>
>
>
> }
>
>
>
>
> publicstaticvoid main(String[] args) {
>
>
>
>
> FileOutputStream out = null;
>
>
>
>
> FileOutputStream outSTr = null;
>
>
>
>
> BufferedOutputStream Buff = null;
>
>
>
>
> FileWriter fw = null;
>
>
>
>
> intcount = 1000;// 写文件行数
>
>
>
>
> try {
>
>
>
>
> outSTr = new FileOutputStream(new File("data.csv"));
>
>
>
>
> Buff = new BufferedOutputStream(outSTr);
>
>
>
>
> longbegin0 = System.currentTimeMillis();
>
> Buff.write(
>
> "ID,date,country,name,phonetype,serialname,salary,
> name1,name2,name3,name4,name5,name6,name7,name8\n"
>
> .getBytes());
>
>
>
>
> intidcount = 10000000;
>
> intdatecount = 30;
>
> intcountrycount = 5;
>
> // intnamecount =5000000;
>
> intphonetypecount = 10000;
>
> intserialnamecount = 50000;
>
> // intsalarycount = 200000;
>
> Map<Integer, String> countryMap = new HashMap<Integer, String>();
>
> countryMap.put(1, "usa");
>
> countryMap.put(2, "uk");
>
> countryMap.put(3, "china");
>
> countryMap.put(4, "indian");
>
> countryMap.put(0, "canada");
>
>
>
>
> StringBuilder sb = null;
>
> for (inti = idcount; i >= 0; i--) {
>
>
>
>
> sb = new StringBuilder();
>
> sb.append(4000000 + i).append(",");// id
>
> sb.append("2015/8/" + (i % datecount + 1)).append(",");
>
> sb.append(countryMap.get(i % countrycount)).append(",");
>
> sb.append("name" + (1600000 - i)).append(",");// name
>
> sb.append("phone" + i % phonetypecount).append(",");
>
> sb.append("serialname" + (100000 + i %
> serialnamecount)).append(",");// serialname
>
> sb.append(i + 500000).append(",");
>
> sb.append("name1" + (i + 100000)).append(",");// name
>
> sb.append("name2" + (i + 200000)).append(",");// name
>
> sb.append("name3" + (i + 300000)).append(",");// name
>
> sb.append("name4" + (i + 400000)).append(",");// name
>
> sb.append("name5" + (i + 500000)).append(",");// name
>
> sb.append("name6" + (i + 600000)).append(",");// name
>
> sb.append("name7" + (i + 700000)).append(",");// name
>
> sb.append("name8" + (i + 800000)).append(",").append('\n');
>
>
>
>
> Buff.write(sb.toString().getBytes());
>
>
>
>
> }
>
>
>
>
> Buff.flush();
>
>
>
>
> Buff.close();
>
> System.out.println("sb.toString():" + sb.toString());
>
> longend0 = System.currentTimeMillis();
>
>
>
>
> System.out.println("BufferedOutputStream执行耗时:" + (end0 - begin0) +
> " 豪秒");
>
>
>
>
> } catch (Exception e) {
>
>
>
>
> e.printStackTrace();
>
>
>
>
> }
>
>
>
>
> finally {
>
>
>
>
> try {
>
>
>
>
> // fw.close();
>
>
>
>
> Buff.close();
>
>
>
>
> outSTr.close();
>
>
>
>
> // out.close();
>
>
>
>
> } catch (Exception e) {
>
>
>
>
> e.printStackTrace();
>
>
>
>
> }
>
>
>
>
> }
>
>
>
>
> }
>
>
>
>
> }

--
Regards
Liang

Henry Saputra

Mar 29, 2017; 8:57am

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

In reply to this post by manishgupta88

Sure, lets do one more release

+1

On Mon, Mar 27, 2017 at 2:58 AM, manish gupta <[hidden email]>
wrote:

> +1
>
> Regards
> Manish Gupta
>
> On Mon, Mar 27, 2017 at 2:41 PM, Kumar Vishal <[hidden email]>
> wrote:
>
> > +1
> > -Regards
> > Kumar Vishal
> >
> > On Mar 27, 2017 09:31, "xm_zzc" <[hidden email]> wrote:
> >
> > > Hi, Liang:
> > > Thanks for your reply.
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-carbondata-
> > > mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION-
> > > Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
>

... [show rest of quote]