Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[jira] [Commented] (CARBONDATA-4187) Performance Issue with Materialized views - increased loading time due to full refresh

Classic

List

Threaded

1 message

Akash R Nilugal (Jira)

[jira] [Commented] (CARBONDATA-4187) Performance Issue with Materialized views - increased loading time due to full refresh

[ https://issues.apache.org/jira/browse/CARBONDATA-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343932#comment-17343932 ]

Sushant Sammanwar commented on CARBONDATA-4187:
-----------------------------------------------

Below is my carbon.properties file :

carbon.lock.retries=15
spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver
upload.threads=256
spark.deploy.zookeeper.url=zookeeper:2181
carbon.lock.retry.timeout.sec=1
spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:postgresql://postgres:5432/postgres
query.max.parallel=32
data.location=/opt/basecamp/timeseries/diamond/warehouse
spark.files.maxPartitionBytes=16777216
http.port=30014
import.max.parallel=8
carbon.unsafe.working.memory.in.mb=4958
http.max-request-size=1000000
carbon.enable.auto.load.merge=false
schedule.threads=256
spark.master.cores.ratio=1
telnet.port=31008
sort.inmemory.size.inmb=2125
cluster.master.host=timeseries-0.timeseries
max.rest.call=256
database.read.url=diamonddb://diamond-db-read:30110
carbon.lock.path=LogPath
spark.hadoop.javax.jdo.option.ConnectionPassword=postgres
carbon.lock.type=ZOOKEEPERLOCK
mv=false
spark.sql.autoBroadcastJoinThreshold=1024288000
carbon.compaction.level.threshold=10,6
spark.deploy.zookeeper.url=zookeeper:2181
cluster.master.port=30014
carbon.segment.lock.files.preserve.hours=1
database.url=diamonddb-direct://localhost
spark.hadoop.javax.jdo.option.ConnectionUserName=postgres
carbon.push.rowfilters.for.vector=true

Below are spark default config :

spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver
spark.deploy.zookeeper.url=zookeeper:2181
spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:postgresql://postgres:5432/postgres
spark.files.maxPartitionBytes=16777216
spark.master.cores.ratio=1
spark.hadoop.javax.jdo.option.ConnectionPassword=postgres
spark.sql.autoBroadcastJoinThreshold=1024288000
spark.hadoop.javax.jdo.option.ConnectionUserName=postgres

> Performance Issue with Materialized views - increased loading time due to full refresh
> --------------------------------------------------------------------------------------
>
> Key: CARBONDATA-4187
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4187
> Project: CarbonData
> Issue Type: Bug
> Components: core
> Affects Versions: 2.1.0
> Reporter: Sushant Sammanwar
> Priority: Major
> Labels: materializedviews, performance
>
> Hi Team ,
> We have been doing a POC by using Carbon 2.1.0 and created a wrapper code around carbon and deployed it as docker container.
> Concurrent data loading is happening in many tables.
> Our objective if get optimal performance for aggregated queries and using materialized views .
> Our observation is after creating MVs data loading is slow and not able to keep-up the pace of incoming data .
> Process is also consuming a lot of memory when MVs are created .
> Data is received in continuous manner and MVs are refreshed which is resulting in increased load time.
> Ideally MVs should only perform incremental refresh as it doesnot require to calculate old data again.
> But it seems the full refresh is causing high memory usages and increased loading time.
> Testing involved loading data without MVs for 6 hrs , then creating MVs and load data again for 4 hours.
> Loading time with MVs increased there creating backlog of data ( loaded only 1/5 th no. of rows than expected).
> Below are major bottlenecks observed :
> 1. High Memory consumption after creating MVs
> 2. MVs doing a full refresh
> Please find attached details of testing with list of tables.
> Below is definition of table :
> create table if not exists fact_365_1_eutrancell_1 (ts timestamp, metric STRING, tags_id STRING, value DOUBLE, epoch bigint) partitioned by (ts2 timestamp) STORED AS carbondata TBLPROPERTIES ('SORT_COLUMNS'='metric')
> Below is definition of MV :
> create materialized view if not exists fact_365_1_eutrancell_1_hour as select tags_id ,metric,timeseries(ts,'hour') as ts,sum(value),avg(value),min(value),max(value) from fact_365_1_eutrancell_1 group by metric, tags_id, timeseries(ts,'hour')
> Can you suggest why MV creation is slowing down the ingestion so much and what can be done to improve ?
> Is there any way to have incremental refresh of MV - refresh only that hour for which we are loading the data ?

--
This message was sent by Atlassian Jira
(v8.3.4#803005)