Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

xm_zzc
Hi, does this version support for the updating and deleting with spark-2.1? Seems like it does not support, what time is it planned to support it?


------------------ Original ------------------
From:  "ravipesala [via Apache CarbonData Mailing List archive]";<[hidden email]>;
Date:  Sun, Mar 26, 2017 01:16 PM
To:  "恩爸"<[hidden email]>;
Subject:  [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Hi All,

As planned we are going to release Apache CarbonData-1.1.0. Please discuss
and vote for it to initiate 1.1.0 release, i will start to prepare the
release after 3-days of discussion. It will have following features.

 1. Introduced new data format called V3(version 3).

  Improves the sequential IO by keeping larger size blocklets.So read
larger data at once to memory.
  Introduced pages with size of 32000 each for every column inside
blocklet. And min/max is maintained for each page to improve the filter
queries.
  Improved compression/decompression of row pages.
Our all performance is improved by 50% compare to old format as per TPC-H
benchmark results.


2. Alter table support in carbondata. (Only for Spark 2.1)

   Support renaming of existing table.
   Support adding of new column.
   Support removing of new column.
   Support Upcasting(Ex: from smallint to int) of datatype


3. Supported Batch Sort to improve dataloading performance.

   It makes sort step as non blocking step and capable of sorting whole
batch in memory and converts to carbondata file.


4. Improved Single pass load by upgrading to latest netty framework and
launched dictionary client for each loading

5. Supported range filters to combine the between filters to one filter to
improve the filter performance.

6. Apart from features many bugs and improvements are done in this release.

--
Thanks & Regards,
Ravindra



To start a new topic under Apache CarbonData Mailing List archive, email [hidden email]
To unsubscribe from Apache CarbonData Mailing List archive, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Liang Chen
Administrator
Hi

Yes, update and delete feature with spark-2.x, will be supported after 1.1.0.
As planed , 1.2 would support it or earlier.

Regards
Liang

xm_zzc wrote
Hi, does this version support for the updating and deleting with spark-2.1? Seems like it does not support, what time is it planned to support it?




------------------ Original ------------------
From:  "ravipesala [via Apache CarbonData Mailing List archive]";<[hidden email]>;
Date:  Sun, Mar 26, 2017 01:16 PM
To:  "恩爸"<[hidden email]>;

Subject:  [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release



  Hi All,

As planned we are going to release Apache CarbonData-1.1.0. Please discuss
and vote for it to initiate 1.1.0 release, i will start to prepare the
release after 3-days of discussion. It will have following features.

 1. Introduced new data format called V3(version 3).

  Improves the sequential IO by keeping larger size blocklets.So read
larger data at once to memory.
  Introduced pages with size of 32000 each for every column inside
blocklet. And min/max is maintained for each page to improve the filter
queries.
  Improved compression/decompression of row pages.
Our all performance is improved by 50% compare to old format as per TPC-H
benchmark results.


2. Alter table support in carbondata. (Only for Spark 2.1)

   Support renaming of existing table.
   Support adding of new column.
   Support removing of new column.
   Support Upcasting(Ex: from smallint to int) of datatype


3. Supported Batch Sort to improve dataloading performance.

   It makes sort step as non blocking step and capable of sorting whole
batch in memory and converts to carbondata file.


4. Improved Single pass load by upgrading to latest netty framework and
launched dictionary client for each loading

5. Supported range filters to combine the between filters to one filter to
improve the filter performance.

6. Apart from features many bugs and improvements are done in this release.

--  
Thanks & Regards,
Ravindra
 
 
 
  If you reply to this email, your message will be added to the discussion below:
  http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9623.html 
  To start a new topic under Apache CarbonData Mailing List archive, email [hidden email] 
  To unsubscribe from Apache CarbonData Mailing List archive, click here.
  NAML
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

xm_zzc
Hi, Liang:
  Thanks for your reply.
Reply | Threaded
Open this post in threaded view
|

question about dimension's sort order in blocklet level

simafengyun
This post was updated on .
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: question about dimension's sort order in blocklet level

Liang Chen
Administrator
Hi

Please create a new mailing list discussion for your topic.
Please provide all columns' cardinality.

For high cardinality column, system doesn't do dictionary
-------------------------------------------------------
##threshold to identify whether high cardinality column
#high.cardinality.threshold=1000000

Regards
Liang

simafengyun wrote
Hi DEV,

I create table according to the below SQL

    cc.sql("""
           CREATE TABLE IF NOT EXISTS t3
           (ID Int, date Timestamp, country String,
           name String, phonetype String, serialname String, salary Int,
           name1 String, name2 String, name3 String, name4 String, name5 String, name6 String,name7 String,name8 String
           )
           STORED BY 'carbondata'
           """)

after I load data to this table, I found the dimension columns "name" and "name7"  both have no dictionary encode.
column "name" has no inverted index but column "name7" has inverted index
questions:
1. why by default they have no dictionary decode and some have no inverted index?
2. is there any document to introduce these loading strategies?
3. the dimension column "name" has no inverted index, does its' data still have order in DataChunk2 blocklet?
4. as I know, usually dimension column data is sorted and stored in DataChunk2 blocklet.
 which cases the dimension column data are not sorted in DataChunk2 blocklet except user specify the column with no inverted index?


5. as I know the first column of mdk key is always sorted in DataChunk2 blocklet, why not set the isExplicitSorted to true?
Reply | Threaded
Open this post in threaded view
|

question about dimension's sort order in blocklet level

simafengyun
Hi DEV,

 I create table according to the below SQL

    cc.sql("""

           CREATE TABLE IF NOT EXISTS t3

           (ID Int,

date Timestamp,

country String,

name String,

phonetype String,

serialname String,

salary Int,

name1 String,

name2 String,

name3 String,

name4 String,

name5 String,

name6 String,

name7 String,

name8 String

           )

           STORED BY 'carbondata'

           """)

 

data cardinality as below.

|

column cardinality

|
|

name

|

name1

|

name2

|

name3

|

name4

|

name5

|

name6

|

name7

|

name8

|
|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

10000000

|

 

after I load data to this table, I found the dimension columns "name" and "name7"  both have no dictionary encode.

but column "name" has no inverted index and column "name7" has inverted index

questions:

1. the dimension column name  has dictionary decode, but have no inverted index, does its' data still have order in DataChunk2 blocklet?

2. is there any document to introduce these loading strategies?


3. if a dimension column has  no dictionary decode  and no inverted index,  user also didn't specify the column with no inverted index when create table
    does its' data still have order in DataChunk2 blocklet?

4. as I know, by default, all dimension column data are sorted and stored in DataChunk2 blocklet  except user specify the column with no inverted index, right?

5. as I know the first dimension column of mdk key is always sorted in DataChunk2 blocklet, why not set the isExplicitSorted to true?

 

 the attached is used to generate the data.csv

package test;




import java.io.BufferedOutputStream;

import java.io.File;

import java.io.FileOutputStream;

import java.io.FileWriter;

import java.util.HashMap;

import java.util.Map;




publicclass CreateData {




  public CreateData() {




  }




  publicstaticvoid main(String[] args) {




    FileOutputStream out = null;




    FileOutputStream outSTr = null;




    BufferedOutputStream Buff = null;




    FileWriter fw = null;




    intcount = 1000;// 写文件行数




    try {




      outSTr = new FileOutputStream(new File("data.csv"));




      Buff = new BufferedOutputStream(outSTr);




      longbegin0 = System.currentTimeMillis();

      Buff.write(

          "ID,date,country,name,phonetype,serialname,salary,name1,name2,name3,name4,name5,name6,name7,name8\n"

              .getBytes());




      intidcount = 10000000;

      intdatecount = 30;

      intcountrycount = 5;

      // intnamecount =5000000;

      intphonetypecount = 10000;

      intserialnamecount = 50000;

      // intsalarycount = 200000;

      Map<Integer, String> countryMap = new HashMap<Integer, String>();

      countryMap.put(1, "usa");

      countryMap.put(2, "uk");

      countryMap.put(3, "china");

      countryMap.put(4, "indian");

      countryMap.put(0, "canada");




      StringBuilder sb = null;

      for (inti = idcount; i >= 0; i--) {




        sb = new StringBuilder();

        sb.append(4000000 + i).append(",");// id

        sb.append("2015/8/" + (i % datecount + 1)).append(",");

        sb.append(countryMap.get(i % countrycount)).append(",");

        sb.append("name" + (1600000 - i)).append(",");// name

        sb.append("phone" + i % phonetypecount).append(",");

        sb.append("serialname" + (100000 + i % serialnamecount)).append(",");// serialname

        sb.append(i + 500000).append(",");

        sb.append("name1" + (i + 100000)).append(",");// name

        sb.append("name2" + (i + 200000)).append(",");// name

        sb.append("name3" + (i + 300000)).append(",");// name

        sb.append("name4" + (i + 400000)).append(",");// name

        sb.append("name5" + (i + 500000)).append(",");// name

        sb.append("name6" + (i + 600000)).append(",");// name

        sb.append("name7" + (i + 700000)).append(",");// name

        sb.append("name8" + (i + 800000)).append(",").append('\n');




        Buff.write(sb.toString().getBytes());




      }




      Buff.flush();




      Buff.close();

      System.out.println("sb.toString():" + sb.toString());

      longend0 = System.currentTimeMillis();




      System.out.println("BufferedOutputStream执行耗时:" + (end0 - begin0) + " 豪秒");




    } catch (Exception e) {




      e.printStackTrace();




    }




    finally {




      try {




        // fw.close();




        Buff.close();




        outSTr.close();




        // out.close();




      } catch (Exception e) {




        e.printStackTrace();




      }




    }




  }




}
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

kumarvishal09
In reply to this post by xm_zzc
+1
-Regards
Kumar Vishal

On Mar 27, 2017 09:31, "xm_zzc" <[hidden email]> wrote:

> Hi, Liang:
>   Thanks for your reply.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION-
> Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

manishgupta88
+1

Regards
Manish Gupta

On Mon, Mar 27, 2017 at 2:41 PM, Kumar Vishal <[hidden email]>
wrote:

> +1
> -Regards
> Kumar Vishal
>
> On Mar 27, 2017 09:31, "xm_zzc" <[hidden email]> wrote:
>
> > Hi, Liang:
> >   Thanks for your reply.
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION-
> > Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
Reply | Threaded
Open this post in threaded view
|

question about dimension's sort order in blocklet level

Liang Chen
Administrator
This post was updated on .
In reply to this post by simafengyun
Hi

Can you provide one table to show your info, can't see very clear?

The column of high cardinality(>1000000) would not do dictionary.

Regards
Liang

2017-03-27 14:32 GMT+05:30 马云 <simafengyun1984@163.com>:

> Hi DEV,
>
>  I create table according to the below SQL
>
>     cc.sql("""
>
>            CREATE TABLE IF NOT EXISTS t3
>
>            (ID Int,
>
> date Timestamp,
>
> country String,
>
> name String,
>
> phonetype String,
>
> serialname String,
>
> salary Int,
>
> name1 String,
>
> name2 String,
>
> name3 String,
>
> name4 String,
>
> name5 String,
>
> name6 String,
>
> name7 String,
>
> name8 String
>
>            )
>
>            STORED BY 'carbondata'
>
>            """)
>
>
>
> data cardinality as below.
>
> |
>
> column cardinality
>
> |
> |
>
> name
>
> |
>
> name1
>
> |
>
> name2
>
> |
>
> name3
>
> |
>
> name4
>
> |
>
> name5
>
> |
>
> name6
>
> |
>
> name7
>
> |
>
> name8
>
> |
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
> 10000000
>
> |
>
>
>
> after I load data to this table, I found the dimension columns "name" and
> "name7"  both have no dictionary encode.
>
> but column "name" has no inverted index and column "name7" has inverted
> index
>
> questions:
>
> 1. the dimension column name  has dictionary decode, but have no inverted
> index, does its' data still have order in DataChunk2 blocklet?
>
> 2. is there any document to introduce these loading strategies?
>
>
> 3. if a dimension column has  no dictionary decode  and no inverted
> index,  user also didn't specify the column with no inverted index when
> create table
>     does its' data still have order in DataChunk2 blocklet?
>
> 4. as I know, by default, all dimension column data are sorted and stored
> in DataChunk2 blocklet  except user specify the column with no inverted
> index, right?
>
> 5. as I know the first dimension column of mdk key is always sorted in
> DataChunk2 blocklet, why not set the isExplicitSorted to true?
>
>
>
>  the attached is used to generate the data.csv
>
> package test;
>
>
>
>
> import java.io.BufferedOutputStream;
>
> import java.io.File;
>
> import java.io.FileOutputStream;
>
> import java.io.FileWriter;
>
> import java.util.HashMap;
>
> import java.util.Map;
>
>
>
>
> publicclass CreateData {
>
>
>
>
>   public CreateData() {
>
>
>
>
>   }
>
>
>
>
>   publicstaticvoid main(String[] args) {
>
>
>
>
>     FileOutputStream out = null;
>
>
>
>
>     FileOutputStream outSTr = null;
>
>
>
>
>     BufferedOutputStream Buff = null;
>
>
>
>
>     FileWriter fw = null;
>
>
>
>
>     intcount = 1000;// 写文件行数
>
>
>
>
>     try {
>
>
>
>
>       outSTr = new FileOutputStream(new File("data.csv"));
>
>
>
>
>       Buff = new BufferedOutputStream(outSTr);
>
>
>
>
>       longbegin0 = System.currentTimeMillis();
>
>       Buff.write(
>
>           "ID,date,country,name,phonetype,serialname,salary,
> name1,name2,name3,name4,name5,name6,name7,name8\n"
>
>               .getBytes());
>
>
>
>
>       intidcount = 10000000;
>
>       intdatecount = 30;
>
>       intcountrycount = 5;
>
>       // intnamecount =5000000;
>
>       intphonetypecount = 10000;
>
>       intserialnamecount = 50000;
>
>       // intsalarycount = 200000;
>
>       Map<Integer, String> countryMap = new HashMap<Integer, String>();
>
>       countryMap.put(1, "usa");
>
>       countryMap.put(2, "uk");
>
>       countryMap.put(3, "china");
>
>       countryMap.put(4, "indian");
>
>       countryMap.put(0, "canada");
>
>
>
>
>       StringBuilder sb = null;
>
>       for (inti = idcount; i >= 0; i--) {
>
>
>
>
>         sb = new StringBuilder();
>
>         sb.append(4000000 + i).append(",");// id
>
>         sb.append("2015/8/" + (i % datecount + 1)).append(",");
>
>         sb.append(countryMap.get(i % countrycount)).append(",");
>
>         sb.append("name" + (1600000 - i)).append(",");// name
>
>         sb.append("phone" + i % phonetypecount).append(",");
>
>         sb.append("serialname" + (100000 + i %
> serialnamecount)).append(",");// serialname
>
>         sb.append(i + 500000).append(",");
>
>         sb.append("name1" + (i + 100000)).append(",");// name
>
>         sb.append("name2" + (i + 200000)).append(",");// name
>
>         sb.append("name3" + (i + 300000)).append(",");// name
>
>         sb.append("name4" + (i + 400000)).append(",");// name
>
>         sb.append("name5" + (i + 500000)).append(",");// name
>
>         sb.append("name6" + (i + 600000)).append(",");// name
>
>         sb.append("name7" + (i + 700000)).append(",");// name
>
>         sb.append("name8" + (i + 800000)).append(",").append('\n');
>
>
>
>
>         Buff.write(sb.toString().getBytes());
>
>
>
>
>       }
>
>
>
>
>       Buff.flush();
>
>
>
>
>       Buff.close();
>
>       System.out.println("sb.toString():" + sb.toString());
>
>       longend0 = System.currentTimeMillis();
>
>
>
>
>       System.out.println("BufferedOutputStream执行耗时:" + (end0 - begin0) +
> " 豪秒");
>
>
>
>
>     } catch (Exception e) {
>
>
>
>
>       e.printStackTrace();
>
>
>
>
>     }
>
>
>
>
>     finally {
>
>
>
>
>       try {
>
>
>
>
>         // fw.close();
>
>
>
>
>         Buff.close();
>
>
>
>
>         outSTr.close();
>
>
>
>
>         // out.close();
>
>
>
>
>       } catch (Exception e) {
>
>
>
>
>         e.printStackTrace();
>
>
>
>
>       }
>
>
>
>
>     }
>
>
>
>
>   }
>
>
>
>
> }




--
Regards
Liang
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

Henry Saputra
In reply to this post by manishgupta88
Sure, lets do one more release

+1

On Mon, Mar 27, 2017 at 2:58 AM, manish gupta <[hidden email]>
wrote:

> +1
>
> Regards
> Manish Gupta
>
> On Mon, Mar 27, 2017 at 2:41 PM, Kumar Vishal <[hidden email]>
> wrote:
>
> > +1
> > -Regards
> > Kumar Vishal
> >
> > On Mar 27, 2017 09:31, "xm_zzc" <[hidden email]> wrote:
> >
> > > Hi, Liang:
> > >   Thanks for your reply.
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-carbondata-
> > > mailing-list-archive.1130556.n5.nabble.com/Re-DISCUSSION-
> > > Initiating-Apache-CarbonData-1-1-0-incubating-Release-tp9672p9680.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
>