Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[jira] [Commented] (CARBONDATA-4187) Performance Issue with Materialized views - increased loading time due to full refresh

Classic

List

Threaded

1 message

Akash R Nilugal (Jira)

[jira] [Commented] (CARBONDATA-4187) Performance Issue with Materialized views - increased loading time due to full refresh

[ https://issues.apache.org/jira/browse/CARBONDATA-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350753#comment-17350753 ]

Sushant Sammanwar commented on CARBONDATA-4187:
-----------------------------------------------

Team ,

If i have a hour MV and data ingestion is happening every 15 mins.
Irrespective of functions used, it should not recalculate / refresh data for previous hours.
Even if its Average function , current time is 5:32 PM then it should recalculate avg value of 5th hour - 5pm to 6pm.
It doesnot need to calculate / refresh for previous hour.
Otherwise MV are not useful and will always be costly affair.

Please confirm and fix the issue.
If not we need to drop Carbon DB from our POC.

> Performance Issue with Materialized views - increased loading time due to full refresh
> --------------------------------------------------------------------------------------
>
> Key: CARBONDATA-4187
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4187
> Project: CarbonData
> Issue Type: Bug
> Components: core
> Affects Versions: 2.1.0
> Reporter: Sushant Sammanwar
> Priority: Major
> Labels: materializedviews, performance
>
> Hi Team ,
> We have been doing a POC by using Carbon 2.1.0 and created a wrapper code around carbon and deployed it as docker container.
> Concurrent data loading is happening in many tables.
> Our objective if get optimal performance for aggregated queries and using materialized views .
> Our observation is after creating MVs data loading is slow and not able to keep-up the pace of incoming data .
> Process is also consuming a lot of memory when MVs are created .
> Data is received in continuous manner and MVs are refreshed which is resulting in increased load time.
> Ideally MVs should only perform incremental refresh as it doesnot require to calculate old data again.
> But it seems the full refresh is causing high memory usages and increased loading time.
> Testing involved loading data without MVs for 6 hrs , then creating MVs and load data again for 4 hours.
> Loading time with MVs increased there creating backlog of data ( loaded only 1/5 th no. of rows than expected).
> Below are major bottlenecks observed :
> 1. High Memory consumption after creating MVs
> 2. MVs doing a full refresh
> Please find attached details of testing with list of tables.
> Below is definition of table :
> create table if not exists fact_365_1_eutrancell_1 (ts timestamp, metric STRING, tags_id STRING, value DOUBLE, epoch bigint) partitioned by (ts2 timestamp) STORED AS carbondata TBLPROPERTIES ('SORT_COLUMNS'='metric')
> Below is definition of MV :
> create materialized view if not exists fact_365_1_eutrancell_1_hour as select tags_id ,metric,timeseries(ts,'hour') as ts,sum(value),avg(value),min(value),max(value) from fact_365_1_eutrancell_1 group by metric, tags_id, timeseries(ts,'hour')
> Can you suggest why MV creation is slowing down the ingestion so much and what can be done to improve ?
> Is there any way to have incremental refresh of MV - refresh only that hour for which we are loading the data ?

--
This message was sent by Atlassian Jira
(v8.3.4#803005)