[jira] [Commented] (CARBONDATA-4079) Queries with Date range are taking time

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CARBONDATA-4079) Queries with Date range are taking time

Akash R Nilugal (Jira)

    [ https://issues.apache.org/jira/browse/CARBONDATA-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247041#comment-17247041 ]

suyash yadav commented on CARBONDATA-4079:
------------------------------------------

Below are the complete background of the issue.

 

We created two tables named as fact_10days_data_ts and tags_final and below are the description of these two tables.

 

*Spark.sql("desc extended fact_timestamp_global").show(100,false)*

col_name                     |data_type                                                |comment|
+-----------------------------+---------------------------------------------------------+-------+
|ts                           |timestamp                                                |null   |
|metric                       |string                                                   |null   |
|tags_id                      |string                                                   |null   |
|value                        |double                                                   |null   |
|ts2                          |timestamp                                                |null   |
|                             |                                                         |       |
|## Detailed Table Information|                                                         |       |
|Database                     |default                                                  |       |
|Table                        |fact_timestamp_global                                    |       |
|Owner                        |root                                                     |       |
|Created                      |Wed Dec 09 19:31:48 MYT 2020                             |       |
|Location                     |hdfs://10.212.5.2:7200/Data/default/fact_timestamp_global|       |
|External                     |false                                                    |       |
|Transactional                |true                                                     |       |
|Streaming                    |false                                                    |       |
|Table Block Size             |1024 MB                                                  |       |
|Table Blocklet Size          |64 MB                                                    |       |
|Comment                      |                                                         |       |
|Bad Record Path              |                                                         |       |
|Min Input Per Node Per Load  |0.0B                                                     |       |
|                             |                                                         |       |
|## Index Information         |                                                         |       |
|Sort Scope                   |global_sort                                              |       |
|Sort Columns                 |ts, metric                                               |       |
|Inverted Index Columns       |                                                         |       |
|Cached Min/Max Index Columns |All columns                                              |       |
|Min/Max Index Cache Level    |BLOCK                                                    |       |
|Table page size in mb        |                                                         |       |
|                             |                                                         |       |
|## Encoding Information      |                                                         |       |
|Local Dictionary Enabled     |true                                                     |       |
|Local Dictionary Threshold   |10000                                                    |       |
|Local Dictionary Include     |metric,tags_id                                           |       |
|                             |                                                         |       |
|## Compaction Information    |                                                         |       |
|MAJOR_COMPACTION_SIZE        |1024                                                     |       |
|AUTO_LOAD_MERGE              |false                                                    |       |
|COMPACTION_LEVEL_THRESHOLD   |4,3                                                      |       |
|COMPACTION_PRESERVE_SEGMENTS |0                                                        |       |
|ALLOWED_COMPACTION_DAYS      |0                                                        |       |
|                             |                                                         |       |
|## Partition Information     |                                                         |       |
|Partition Type               |NATIVE_HIVE                                              |       |
|Partition Columns            |ts2:TIMESTAMP                                            |       |
|Number of Partitions         |1                                                        |       |
|                             |                                                         |       |
|## Dynamic Information       |                                                         |       |
|Table Data Size              |861.09MB                                                 |       |
|Table Index Size             |147.63KB                                                 |       |
|Last Update                  |Wed Dec 09 19:46:01 MYT 2020                             |       |
+-----------------------------+---------------------------------------------------------+-------+
  *second table:*

*spark.sql("desc extended tags_10_days_test").show(false)*

 

col_name                     |data_type                                            |comment|
+-----------------------------+-----------------------------------------------------+-------+
|id                           |string                                               |null   |
|resource                     |string                                               |null   |
|                             |                                                     |       |
|## Detailed Table Information|                                                     |       |
|Database                     |default                                              |       |
|Table                        |tags_10_days_test                                    |       |
|Owner                        |root                                                 |       |
|Created                      |Tue Dec 08 16:25:43 MYT 2020                         |       |
|Location                     |hdfs://10.212.5.2:7200/Data/default/tags_10_days_test|       |
|External                     |false                                                |       |
|Transactional                |true                                                 |       |
|Streaming                    |false                                                |       |
|Table Block Size             |1024 MB                                              |       |
|Table Blocklet Size          |64 MB                                                |       |
|Comment                      |                                                     |       |
|Bad Record Path              |                                                     |       |
|Min Input Per Node Per Load  |0.0B                                                 |       |
|                             |                                                     |       |
|## Index Information         |                                                     |       |
|Sort Scope                   |LOCAL_SORT                                           |       |
|Sort Columns                 |id, resource                                         |       |
|Inverted Index Columns       |id,resource                                          |       |
|Cached Min/Max Index Columns |All columns                                          |       |
|Min/Max Index Cache Level    |BLOCK                                                |       |
|Table page size in mb        |                                                     |       |
|                             |                                                     |       |
|## Encoding Information      |                                                     |       |
|Local Dictionary Enabled     |true                                                 |       |
|Local Dictionary Threshold   |10000                                                |       |
|Local Dictionary Include     |id,resource                                          |       |
|                             |                                                     |       |
|## Compaction Information    |                                                     |       |
|MAJOR_COMPACTION_SIZE        |1024                                                 |       |
|AUTO_LOAD_MERGE              |false                                                |       |
|COMPACTION_LEVEL_THRESHOLD   |4,3                                                  |       |
|COMPACTION_PRESERVE_SEGMENTS |0                                                    |       |
|ALLOWED_COMPACTION_DAYS      |0                                                    |       |
|                             |                                                     |       |
|## Dynamic Information       |                                                     |       |
|Table Data Size              |978.79KB                                             |       |
|Table Index Size             |585.0B                                               |       |
|Last Update                  |Tue Dec 08 16:31:39 MYT 2020                         |       |
+-----------------------------+-----------------------------------------------------+-------+


We have 6 metrics, 100000 resources and for every 5 minutes there will be a record loaded in to the table for all the resource and metric combination. So number of records per 5 minute poll is 600000 (6*100000). As we have one day data available in this fact_timestamp_global table the total record count is 172800000 (5*12*24*6*100000).

 

So overall fact_timestamp_global table contains 172800000 records whereas tags_final table contains 100000 records. Now we are trying to fetch the data for a particular resource and two metrics combination for a day using below select query.

 

spark.sql("select ts,tags_id,resource,metric,value from {color:#172b4d}fact_timestamp_global{color} left join tags_final on {color:#172b4d}fact_timestamp_global{color}.tags_id= tags_final.id where metric in ('Outbound Utilization (percent)','Inbound Utilization (percent)') and resource='10.212.7.98_if:<0001>' and  ts>='2020-09-21 00:00:00' and ts<='2020-09-21 23:55:55' group by ts,tags_id,resource,metric,value").show(10000,false)

 

Above query is taking around 13 to 15 seconds to execute.

 

So we would be really thankful if someone can go over all the details shared above and help us in reducing the time taken to less than 5 seconds.

> Queries with Date range are taking time
> ---------------------------------------
>
>                 Key: CARBONDATA-4079
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-4079
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: data-query
>    Affects Versions: 2.1.0
>            Reporter: suyash yadav
>            Priority: Major
>
> Hi Team,
> We are doing a POC to understand how can we improve performance of the query fired against table created in apache carbondata.
> Below is the sample query:
>  
> *spark.sql("select ts,resource,metric,value from fact_timestamp_global left join tags_10_Days_test on fact_timestamp_global.tags_id= tags_10_Days_test.id where metric in ('Outbound Utilization (percent)','Inbound Utilization (percent)') and resource='10.212.7.98_if:<0001>' and  ts between '2020-09-21 00:00:00' and '2020-09-21 12:55:55' group by ts,resource,metric,value").show(10000,false)*
> As you can see above query contains the date range filter.We have noticed that due to this date range filter the query time is coming around 15 seconds which is not proving useful as we have to bring down the query execution time to 3 to 4 seconds.
> Could you please review above query and suggest a better way of framing the above query specially the date range filter which can be  helpful to get the desired query execution time?
>  
> In case you need more details then please do let me know. 
>  
> Regards
> Suyash Yadav



--
This message was sent by Atlassian Jira
(v8.3.4#803005)