Troubleshooting performance with Jira Stats
Platform notice: Server and Data Center only. This article only applies to Atlassian products on the Server and Data Center platforms.
Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Intro
Jira has a number of stats logs (metrics) that can be used to troubleshoot different performance problems. The goal of the page is to give an overview of metrics that can be used as a way to measure the Jira performance or give a sense of the environment performance (disk, network, or database).
Please check the related article for more details regarding the types of logs: Jira stats logs
Note about the Healthy State column: the values in it are based on data from our internal Jira instances and provided here as a general reference.
Applicable to versions: 8.13+
DC Specific
LOCALQ
, DBR
, and INDEX-REPLAY
loggings are present only in the Data Center version.
Disk performance
Why is this important metric to check:
slow IO will affect user requests related to Lucene search (JQL) and updates (issues Create/Update)
slow IO will affect node2node cache communications (slow updates of the LocalQ files)
slow IO will affect issues reindexing time: Full Reindex, Background, and Project reindex
Metrics to check:
Stats | Metric | Description | Healthy state |
---|---|---|---|
[LOCALQ] [VIA-INVALIDATION] | timeToAddMillis | The time to store a cache replication message in a local store before sending it to the other node; write performance of the LocalQ, mostly disk | ~1 ms |
DB performance
Why is this important metric to check:
slow IO will affect user requests related to issues Create/Update, viewing Single issue, generating reports
slow IO will affect the cache population
slow IO will affect issues reindexing time: Full Reindex, Background, and Project reindex
Metrics to check:
Stats | Metric | Description | Healthy state |
---|---|---|---|
[VERSIONING] | getIssueVersionMillis | Time to read the single row in the Version table, we expect this to be fast operations which include: processing + network + DB read time. | ~5 ms |
[VERSIONING] | incrementIssueVersionMillis | Time to update (increment) the single row in the version table, we expect this to be fast operations which include: processing + network + DB update time. | ~10 ms |
Internode network latency
Why is this important metric to check:
slow IO will affect the cache replication (cluster consistency)
slow IO will affect cluster index replication (DBR)
Metrics to check:
Stats | Metric | Description | Healthy state |
---|---|---|---|
[LOCALQ] [VIA-INVALIDATION] | timeToSendMillis | The time to deliver the cache replication message from the current node to the destination node. | ~10 ms |
[LOCALQ] [VIA-INVALIDATION] | queueSize | Size of the cache replication Queue | 0 |
[DBR] [RECEIVER]
| receiveDBRMessage | Time difference between generating the message and receiving it. Note that time is local to both nodes so this includes time drift between nodes. Includes: serialization/de-serialization + time spent in the LocalQ + time to send the message (RTT) | ~100 millis |
Index reads/writes (Lucene)
Why is this important metric to check:
slow operations will affect user requests related to issues Create/Update/Transition
slow operations will affect issues reindexing time: Full Reindex, Background, and Project reindex
In the following table, you'll find the metrics you should check.
In the Healthy value column, the values for [index-writer-stats]
, [DBR] [RECEIVER]
, and [LOCALQ] [VIA-COPY]
indicate the approximate average around which you can expect a healthy value. Unless these values are greatly exceeded, the system keeps working optimally.
Stats | Metric | Description | Healthy value |
---|---|---|---|
See Disk performance | |||
See DB performance | |||
[index-writer-stats] | updateDocumentsWith | Time spent for all steps required to conditionally add/update the index in the Lucene (Search + Updating Lucene) | avg ~10 ms |
[DBR] [RECEIVER]
| processDBRMessageUpdateWith | Time for the complex object to wait in the Lucene queue to write to the index includes: waiting in the queue + conditional updating Lucene | avg ~30 ms |
[LOCALQ] [VIA-COPY] | timeToSendMillis | Time to deliver and apply the cache replication message from the current node to the destination node. Includes serialization/de-serialization, time to send the message (RTT), acknowledgement from the receiver, and update Lucene. | avg ~50 ms |
| flushIntervalMillis | The average time to the flush was triggered since the last snapshot. The metric is mostly relevant to the Foreground reindex and may indicate the need to increase the Lucene buffer. | avg > 1 sec (Note: during foreground indexing) |
Indexing performance
Why is this important metric to check:
slow operations will affect user requests related to issues Create/Update/Transition
slow operations will affect issues reindexing time: Full Reindex, Background, and Project reindex
slow operations will affect cluster consistency
Metrics to check:
Stats | Metric | Description | Healthy state |
---|---|---|---|
See Disk performance | |||
See DB performance | |||
See Index reads/writes (Lucene) | |||
[indexing-stats] | addIndex.avg | Average time to load the data from the Custom Field provider | Depends on the custom fields functionality. Note that this has a direct implication on indexing time and affects every request creating/updating issues, comments, and full reindex time. |
[INDEX-REPLAY] [STATS]
| timeInMillis | Time of processing a batch of index operations; the replay process is run every 5 secs | ~ 5 secs |
[INDEX-REPLAY] [STATS]
| updateIndexInMillis | Time to perform all required (after checking DBR) index operations (document creation + Lucene) in a specific batch | ~ 5 secs |
[INDEX-REPLAY] [STATS]
| DBR effectiveness | DBR effectiveness = (1 - filterOutAlreadyIndexedAfterCounter.ISSUE.sum / filterOutAlreadyIndexedBeforeCounter.ISSUE.sum) | > 90% |
[INDEX-REPLAY] [STATS]
| numberOfRemoteOperations | The number of processed index operations from other nodes; helps check traffic distribution and cluster load | |
[INDEX-REPLAY] [STATS]
| numberOfLocalOperations | The number of processed index operations from the current node; helps check traffic distribution and cluster load |