App metrics reference

On this page

Still need help?

The Atlassian Community is here for you.

Ask the community

On this page:

App monitoring can give you a deeper insight into what apps are doing in your instance. This can be useful when troubleshooting issues with a specific app, or to help you determine whether an app may have contributed to a drop in overall performance or stability.

Learn how to set up app monitoring

Full list of app performance metrics

This is the full list of metrics that are exposed by the app monitoring agent. This is in addition to any JMX beans that are exposed by the application.

index.reindex

This metric indicates that information has been reindexed.

More details...

The metric consists of a number of tags:

  • If indexContent is true, content was reindexed such as page content
  • If indexAttachments is true, attachments were reindexed
  • if indexUsers is true, users were reindexed
  • If limitedWithQuery is false then everything was re-indexed.
  • If active is 1, a reindex is in progress, and the duration will indicate how long it has been running for. 

Action

Reindexing can degrade your site's performance. Ideally, you would reindex during off-peak times.

The invokerPluginKey will indicate which app kicked off the reindexing. If the key starts with com.atlassian then it’s likely to be something within Confluence. 

Sample query

com_atlassian_confluence_metrics_99thPercentile
  {
   category00="index",
   name="reindex"
  }

search.manager

Measures how long a search request takes.

More details...
  • methodName indicates the api invoked

    • search

    • searchWithToken

    • searchWithRequestedFields

    • searchEntities

    • explain

    • searchCategorised

    • convertToEntities

    • scanWithSearchFilter

    • scanWithSearchQuery

    • scanWithIndexesAndSearchQuery

    • The metrics with method name as one of [scanWithSearchFilter, scanWithSearchQuery, scanWithIndexesAndSearchQuery] are expected to take longer time.

  • usedFilter indicates if a filter is used.

  • searchType represents the type of search, can be one of [SiteSearch, ContentSearch, CQLSearch].

  • resultSize the number of documents matches the search, only applicable when methodName=convertToEntities


Action

Use the pluginkey to identify which app is calling the search API (com.atlassian.confluence.search.v2.SearchManager).

If you notice an app is making a lot of searches, or consistently takes a long time to process search results, reach out to the app vendor. 

Sample query

com_atlassian_confluence_metrics_99thPercentile
  {
   category00="search",
   name="manager"
  }

db.ao.upgradetask

Measures how long an app is taking to upgrade a part of the data it stores in the database.

Upgrade tasks can happen when an app is updated or enabled. During this time the app functionality will be unavailable, and may temporarily increase load on the database and the node the upgrade task is running on.

Action

If an app stores a lot of data in database consider scheduling any updates when Confluence is less busy.

Sample query

com_atlassian_confluence_metrics_Value
  {
   category00="db",
   category01="ao",
   name="upgradetask"
  }

db.ao.executeInTransaction

Measures how long an Active Objects (AO) transaction takes when executed inside the TransactionCallBack. This is mainly used by Confluence plugins.

Action

The transaction can have many AO operations. The problem may be that there are too many operations, the query is long running, or the database is under load. 

Sample query

com_atlassian_confluence_metrics_Value
  {
   category00="db",
   category01="ao",
   name="executeInTransaction"
  }

db.ao.entityManager

Measures how long an Active Objects (AO) operation (create, find, delete, deleteWithSQL, get, stream, count) that uses the entityManager takes.

Action

The operation query may be long running, or the database is under load. 

Sample query

com_atlassian_confluence_metrics_95thPercentile
  {
   category00="db",
   category01="ao",
   category02="entityManager"
  }


Can be filtered further by adding a name="<operation>" attribute, for example name="find".

db.cluster.lock.held.duration

Measures how long a database cluster lock was held. Used by Confluence in a clustered environment.

Action

Lock contention can lead to performance degradation. It may be normal for a thread to hold on to a lock for a long time, if there aren't any threads waiting for the lock.

See db.cluster.lock.waited.duration to find out if there are any threads waiting for the lock.

Sample query

com_atlassian_confluence_metrics_Value
  {
   category00="cluster",
   category01="lock",
   category02="held"
  }

db.cluster.lock.waited.duration

Measures how long a database cluster lock was waited for. Used by Confluence in a clustered environment.

Action

If many threads are waiting for the same lock, it can lead to performance degradation. 

Sample query

com_atlassian_confluence_metrics_Value
  {
   category00="cluster",
   category01="lock",
   category02="waited"
  }
db.sal.transactionalExecutor

Measures how long a Shared Application Layer (SAL) transaction takes, when executed inside the DefaultTransactionalExecutor.

Action

The transaction can have many SAL operations, it can be either there are too many operations or the query is long running, or the database is under load. 

Sample query

com_atlassian_confluence_metrics_Value
  {
   category00="db",
   category01="sal",
   name="transactionalExecutor", 
   statistic="active"
  }
web.resource.condition

Measures how long a web resource condition will take to determine whether a resource should be displayed or not.

Action

Slow web resource conditions can lead to slow page load times especially if they are not cached.

Sample query

com_atlassian_confluence_metrics_95thPercentile
  {
   category00="web",
   category01="resource",
   name="condition"
  }
plugin.disabled.counter

Measures how many times an app was disabled since uptime.

Action

Some caches are cleared when an app is disabled or enabled. This can have performance impact. If this number increases, check UPM or the application logs to investigate which app is contributing to this number.

Sample query

com_atlassian_confluence_metrics_Count
  {
   category00="plugin",
   category01="disabled"
  }
plugin.enabled.counter

Measures how many times an app was enabled since uptime.

Action

Some caches are cleared when an app is disabled or enabled. This can have a performance impact. If this number increases, check UPM or the application logs to investigate which app is contributing to this number.

Sample query

com_atlassian_confluence_metrics_Count
  {
   category00="plugin",
   category01="enabled"
  }
soyTemplateRenderer

Measures how long a Soy Template web panel takes to render.

Action

The template renderer might be long running.

Sample query

com_atlassian_confluence_metrics_95thPercentile
  {
   name="webTemplateRenderer",
   templateRenderer="soy"
  }
webTemplateRenderer

Measures how long an Atlassian Template web panel takes to render.

Action

The template renderer might be long running. 

Sample query

com_atlassian_confluence_metrics_95thPercentile
  {
   name="webTemplateRenderer",
   templateRenderer="velocity"
  }
web.fragment.condition

Measures how long web fragment condition will take to determine whether a web fragment should be displayed or not.

Action

Web fragments conditions determine whether a link or a section on a page should be displayed. Slow web fragment conditions lead to slow page load times especially if they are not cached. 

Sample query

com_atlassian_confluence_metrics_95thPercentile
  {
   category00="web", 
   category01="fragment", 
   name="condition"
  }
cacheManager.flushAll

Indicates that all caches are being flushed by an app. This operation should not be triggered by external apps and can lead to product slowdowns.

Action

Use the invokerPluginKey tag to determine which app invoked the flush.

Sample query

com_atlassian_confluence_metrics_Count
  {
    category00="cacheManager",
    name="flushAll"
  }
cache.removeAll

Indicates that a single cache has had all of its entries removed. This may or may not cause slowdowns in products or apps.

Action

Check how often these cache removals occur, and from which product. Use the pluginKeyAtCreation tag to determine which app created the cache. 

Sample query

com_atlassian_confluence_metrics_Count
  {
   category00="cache",
   name="removeAll",
   invokerPluginKey!="undefined"
  }
cachedReference.reset

Indicates that a single entry in a cache has been reset. This may or may not cause slowdowns in products or apps.

Action

Check how often these cache resets occur, and from which product. Use the pluginKeyAtCreation tag to determine which app created the cache.

Sample query

com_atlassian_confluence_metrics_Count
  {
   category00="cachedReference",
   name="reset",
   invokerPluginKey!="undefined"
  }
rest.request

Measures HTTP requests of the REST APIs that uses the atlassian-rest module.

Action

Check the frequency and duration of the rest requests. 

Sample query

com_atlassian_confluence_metrics_95thPercentile
  {
   category00="http", 
   category01="rest", 
   name="request"
  }

Recommended alerts

Automated alerts help you identify issues early, without needing to wait for an end-user to bring problems to your attention. Most APM tools provide alerting capabilities.

The following alerts are based on our research into common issues with apps. We've used Prometheus and Grafana, but you may be able to adapt these rules for other APM tools.

To find out how to set up alerting in Prometheus, see Alerting overview in the Prometheus documentation.

Heap memory usage

Excessive Heap memory consumption often leads to out of memory errors (OOME). While fluctuations in Heap memory consumption are expected and normal, a consistent increase or failure to release this memory, can lead to issues. We suggest creating an alert which is triggered when there is less than 10% free Heap memory left on a node for an amount of time, such as 2 minutes.

  - alert: OutOfMemory
    expr: 100*(jvm_memory_bytes_used{area="heap"}/jvm_memory_bytes_max{area="heap"}) > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Out of memory (instance {{ $labels.instance }})
      description: "Memory is filling up (< 10% left)"

CPU utilization

Consistently high CPU usage can be caused by numerous issues such as process intensive jobs, inefficient code (loops), or too little memory.

We recommend creating an alert that is triggered when CPU load exceeds 80% for an amount of time, such as 2 minutes.

  - alert: HighCpuLoad
    expr: (java_lang_OperatingSystem_ProcessCpuLoad * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: High CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%"

Full GC

Full garbage collection (GC) occurs when both young and old Heap generations are collected. This is time consuming and pauses the application. Full GC can happen for a number of reasons, but a sudden spike may happen when too many large objects are loaded into memory.

We recommend monitoring any significant increase in the number of full GCs. How you do this will vary depending on the type of Collector being used. For the G1 Garbage Collector (G1GC), monitor the java_lang_G1_Old_Generation_CollectionCount metric.

Blocked threads

A high number of blocked or stuck threads means there are fewer threads available to process requests. An increase in blocked threads could indicate a problem.

We recommend creating an alert that is triggered when the number of blocked threads exceeds 10%.

  - alert: BlockedThreads
    expr: avg by(instance) (rate(jvm_threads_state{state="BLOCKED"}[5m])) * 100 > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Blocked Threads (instance {{ $labels.instance }})
      description: "Blocked Threads are > 10%"

Database connection pool

The database connection pool should be tuned for the size of the instance (such as the number of users and plugins). It also needs to match what the database allows.

We recommend creating an alert that is triggered when the number of connections is consistently near the maximum for an amount of time.

Example alert:

  - alert: DatabaseConnections
    expr: 100*(<domain>_BasicDataSource_NumActive{connectionpool="connections"}/
<domain>_BasicDataSource_MaxTotal{connectionpool="connections"}) > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Database Connections (instance {{ $labels.instance }})
      description: "Database Connections are filling up (< 10% left)"

Reacting to alerts

Some issues are transient, or may resolve themselves, while others could be a warning sign of a major performance degradation.

When investigating the source of the problem, the app specific metrics below can help. If it's clear from the metrics that one particular app is spending more time or calling an API more frequently, you could try disabling that app to see whether performance improves. If it's a critical app, raise a support ticket, and include any relevant data extracts from your monitoring with the support zip.

Last modified on May 16, 2024

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.