Indexing inconsistency troubleshooting
JIRA 9 JIRA DC
In Jira DC nodes share their indexes via shared home. What triggers the creation of index snapshot and how it is being used changed across Jira versions.
Legacy mode
Until JIRA 8.19 every starting node would request creating an index snapshot from any existing node in the cluster. This mode requires that a new node can join the cluster only if all existing nodes have a proper index at the time a new node joins the cluster. There are many things which can go wrong in this scenario, like:
- the state of the cluster is not up to date and there is no other node which can provide the index
- the node which handles the request of delivering the index has a faulty index
- the node which handles the request of delivering the index fails to create the index snapshot
- the node which handles the request of delivering the index fails to inform the starting node that the index snapshot was created
and other potential problems which can result in: JRASERVER-72125 - Getting issue details... STATUS
Index snapshot - ready on start
JIRA 8.20
In JIRA 8.19 we have introduced a new way of getting an index snapshot for new nodes. When a new node starts it looks for an index snapshot in shared home. If this snapshot is fresh enough it will restore its index based on this snapshot. Since JIRA 8.19 a random node will produce an index snapshot every 24 hours (by default).
If the starting node fails to get the snapshot from the shared home (no snapshot or the snapshot is not fresh enough) it falls back to legacy mode.
With this change the chance of running into JRASERVER-72125 - Getting issue details... STATUS was greatly reduced. However, it is still possible that the index snapshot created by the scheduler is inconsistent (for example: the scheduler runs on a node where the index is currently not consistent).
Index snapshot - quality guaranteed
JIRA 9.0
In Jira 9.0 we have made a couple of changes to guarantee the quality of the index on shared home.
Index snapshot - location
All index snapshots now use the same file naming scheme regardless of their location:
IndexSnapshot_<unique_number>_<yyMMdd-HHmmss>.<tar.sz|tar|zip>
The index file and snapshot locations have also changed:
<local_home_directory>/caches/indexesV2
stores index files<shared_home_directory>/caches/indexesV2/snapshots
stores index snapshots that were:- created by scheduled index backups
- retrieved by nodes joining the cluster
- used for snapshot recovery
- replicated to the secondary home directory
<shared_home_directory>/caches/indexesV2/snapshots
stores index snapshots created:- on the completion of a full reindex and retrieved by other nodes on reindex detection
- when a new node joined the cluster
- on administrator request
- on data import
Index snapshot - quality
Before creating (and sending) an index snapshot to the shared home the node will always check if the index is consistent. If the index is not consistent the operation will not be performed and this will be only visible in the logs of the node which was requested to create the index snapshot:
Example log message: any time a node is requested to create an index snapshot and fails the index consistency check
ERROR Index backup failed. Index backup can be done only on consistent index.
Example log message: node1 requested an index snapshot from node2
ERROR Note that node: [node1] is waiting for an index and failed to restore the index from shared and from this node
This state require admin action, Both nodes: [node1] and [node2], must obtain a consistent index.
Please check KB: https://confluence.atlassian.com/x/OYNyQg to find out how can you solve this problem.
How to make sure there is a consistent index snapshot on shared home
Full reindex
Running the full-reindex on any node will trigger creating an index snapshot and send it to the shared home.
Index copy
If there is a node in the cluster that contains a consistent index, copying this index to any other node via the admin panel (Admin/System/Indexing/Copy the Search Index from another node) will result in creating an index snapshot on shared home.
With 9.0 changes the chance of running into JRASERVER-72125 - Getting issue details... STATUS should be even lower.
Please make sure that in the process of starting new nodes you include a check that an index snapshot is available in the shared home:
- make sure that index snapshots are created by the scheduler
- any operation triggering large indexing (for example: project import) should be followed by creating an index snapshot
Index Analyzer
When a small number of issues is affected, Jira's index analyzer can list the issues and fix them in a specific node. Check How to use Jira's index analyzer to fix index inconsistencies
Troubleshooting
Please use the following grep across all nodes' logs to see log messages related to indexing and index management:
grep 'IndexUtils\|ArchiveUtils\|DefaultIssueIndexer\|DefaultClusterManager\|DefaultIndexCopyService\|DefaultNodeReindexService\|SnapshotDeletionPolicyContributionStrategy\|DefaultIndexManager' atlassian-jira.log
Q&A
How Jira updates the index with changes done after the index snapshot was created?
Every time the index snapshot is restored (a few hours old from snapshot of "just" created from another node) we will run an "index-fixer" after restoring this snapshot. This is not blocking users from accessing this node (/status may return the status that the node is running) so this can happen in the background.
In JIRA 8.20 we are still running 2 index fixers:
legacy-fixer: which uses the max issue update time from DB vs max issue update time from the restored index: based on this it will reindex all issues in this time range
new version based fixer: this one will try to use the version table to determine which issues (and related entities) need to be re-indexed (or deleted from the index)
In JIRA 9.3 we removed the legacy fixer as it is not needed anymore since all entities have versions.
To see all logs related to fixing the index after restoring it from the snapshot please grep the log with: [INDEX-FIXER]
How do we calculate the time range on which should run?
If the index has meta information with a timestamp we will use this as the time range start (only snapshots created with full-foreground reindex have this timestamp) and get the max issue update time from DB (time range end).
If the index has no meta information with timestamp we will use max issue update time from index (time range start) and get max issue update time from DB (time range end).