Apache CarbonData Dev Mailing List archive

Discussion how to crate the CarbonData table with good performance

Classic

List

Threaded

2 messages Options

bill.zhou

Discussion how to crate the CarbonData table with good performance

Suggestion to create Carbon table

Recently we used CarbonData to do the performance in Telecommunication filed and summarize some of the Suggestions while creating the CarbonData table.
We have tables which range from 10 thousand rows to 10 billion rows and have from 100 columns to 300 columns. Following are some of the columns used in the table.

Column name	Data type	Cardinality	Attribution
msisdn	String	30 million	dimension
BEGIN_TIME	bigint	10 thousand	dimension
HOST	String	1 million	dimension
Dime_1	String	1 thousand	dimension
counter_1	numeric(20,0)	NA	measure
...	...	NA	...
counter_100	numeric(20,0)	NA	measure

We have about more than 50 test cases; according to the test case we summarize some suggestion to create the table which can have a better query performance.
1. Put the frequently-used column filter in the beginning.
For example, MSISDN filter is used in most of the query then put the MSISDN in the first column. The create table command can be as follows, the query which has MSISDN as a filter will be good (because the MSISDN is high cardinality, if create table like this the compress ratio will be decreased)
create table carbondata_table(
msisdn String,
...
)STORED BY 'org.apache.carbondata.format'
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,..','DICTIONARY_INCLUDE'='...');

2. If has multiple column which is frequently-use in the filter, put it to the front in the order as low cardinality to high cardinality.
For example if msisdn, host and dime_1 is frequently-used column, the table column order can be like dime_1->host->msisdn, because the dime_1 cardinality is low. Create table command can be as follows. This will increase the compression ratio and good performance for filter on dime_1, host and msisdn.
create table carbondata_table(
Dime_1 String,
HOST String,
MSISDN String,
...
)STORED BY 'org.apache.carbondata.format'
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST..','DICTIONARY_INCLUDE'='Dime_1..');

3. If no column is frequent-use in filter, then can put all the dimension column order as from low cardinality to high cardinality. Create table command can be as following:
create table carbondata_table(
Dime_1 String,
BEGIN_TIME bigint
HOST String,
MSISDN String,
...
)STORED BY 'org.apache.carbondata.format'
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..');

4. For measure that needs no high accuracy, then no need to use numeric(20,0) data type, suggestion is to use double to replace it than will increase the query performance. If one test case uses double to replace the numeric (20, 0) the query improve 5 times from 15 second to 3 second. Create table command can be as follows.
create table carbondata_table(
Dime_1 String,
BEGIN_TIME bigint
HOST String,
MSISDN String,
counter_1 double,
counter_2 double,
...
counter_100 double,
)STORED BY 'org.apache.carbondata.format'
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');

5. If the column which is always incremental like start_time. For example one scenario: every day we will load data into carbon and the start_time is incremental for each load. For this scenario you can put the start_time column in the back of dimension, because always incremental value can use the min/max index well always. Create table command can be as following.
create table carbondata_table(
Dime_1 String,
HOST String,
MSISDN String,
counter_1 double,
counter_2 double,
BEGIN_TIME bigint,
...
counter_100 double,
)STORED BY 'org.apache.carbondata.format'
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');

One more is for the dimension whether dictionary is needed or not, we suggest if the cardinality higher than 50 thousand do not put it as dictionary column. If high cardinality column put as dictionary will impact the load performance.

Liang Chen

Re: Discussion how to crate the CarbonData table with good performance

Administrator

Hi

Thanks for you shared these experience.
Can you put these FAQ to CWIKI: https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home

Regards
Liang

bill.zhou wrote

Discussion how to crate the CarbonData table with good performance
Suggestion to create Carbon table
Recently we used CarbonData to do the performance in Telecommunication filed and summarize some of the Suggestions while creating the CarbonData table.
We have tables which range from 10 thousand rows to 10 billion rows and have from 100 columns to 300 columns. Following are some of the columns used in the table.
Column name Data type Cardinality Attribution
msisdn String 30 million dimension
BEGIN_TIME bigint 10 thousand dimension
HOST String 1 million dimension
Dime_1 String 1 thousand dimension
counter_1 numeric(20,0) NA measure
... ... NA ...
counter_100 numeric(20,0) NA measure
We have about more than 50 test cases; according to the test case we summarize some suggestion to create the table which can have a better query performance. 
1. Put the frequently-used column filter in the beginning. 
For example, MSISDN filter is used in most of the query then put the MSISDN in the first column. The create table command can be as follows, the query which has MSISDN as a filter will be good (because the MSISDN is high cardinality, if create table like this the compress ratio will be decreased)
create table carbondata_table(
msisdn String,
...
)STORED BY 'org.apache.carbondata.format' 
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,..','DICTIONARY_INCLUDE'='...');
2. If has multiple column which is frequently-use in the filter, put it to the front in the order as low cardinality to high cardinality.
For example if msisdn, host and dime_1 is frequently-used column, the table column order can be like dime_1->host->msisdn, because the dime_1 cardinality is low. Create table command can be as follows. This will increase the compression ratio and good performance for filter on dime_1, host and msisdn.
create table carbondata_table(
Dime_1 String,
HOST String,
MSISDN String,
...
)STORED BY 'org.apache.carbondata.format' 
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST..','DICTIONARY_INCLUDE'='Dime_1..');
3. If no column is frequent-use in filter, then can put all the dimension column order as from low cardinality to high cardinality. Create table command can be as following: 
create table carbondata_table(
Dime_1 String,
BEGIN_TIME bigint
HOST String,
MSISDN String,
...
)STORED BY 'org.apache.carbondata.format' 
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..');
4. For measure that needs no high accuracy, then no need to use numeric(20,0) data type, suggestion is to use double to replace it than will increase the query performance. If one test case uses double to replace the numeric (20, 0) the query improve 5 times from 15 second to 3 second. Create table command can be as follows. 
create table carbondata_table(
Dime_1 String,
BEGIN_TIME bigint
HOST String,
MSISDN String,
counter_1 double,
counter_2 double,
...
counter_100 double,
)STORED BY 'org.apache.carbondata.format' 
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');
5. If the column which is always incremental like start_time. For example one scenario: every day we will load data into carbon and the start_time is incremental for each load. For this scenario you can put the start_time column in the back of dimension, because always incremental value can use the min/max index well always. Create table command can be as following. 
create table carbondata_table(
Dime_1 String,
HOST String,
MSISDN String,
counter_1 double,
counter_2 double,
BEGIN_TIME bigint,
...
counter_100 double,
)STORED BY 'org.apache.carbondata.format' 
TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');

One more is for the dimension whether dictionary is needed or not, we suggest if the cardinality higher than 50 thousand do not put it as dictionary column. If high cardinality column put as dictionary will impact the load performance.