Confluence Data Center disaster recovery

A disaster recovery strategy is a key part of any business continuity plan. It outlines the processes to follow in the event of a disaster, to ensure that the business can recover and keep operating. For Confluence, this means ensuring Confluence's availability in the event that your primary site becomes unavailable.

Confluence Data Center is the only Atlassian-supported high-availability solution for Confluence.

This page demonstrates how you can use Confluence Data Center 5.9 or later in implementing and managing a disaster recovery strategy for Confluence. It doesn't, however, cover the broader business practices, like setting the key objectives (RTO, RPO & RCO¹), and standard operating procedures.

What's the difference between high availability and disaster recovery?

The terms "high availability", "disaster recovery" and "failover" can often be confused. For the purposes of this page, we've defined them as follows:

High availability – A strategy to provide a specific level of availability. In Confluence's case, access to the application and an acceptable response time. Automated correction and failover (within the same location) are usually part of high-availability planning.
Disaster recovery – A strategy to resume operations in an alternate data center (usually in another geographic location), if the main data center becomes unavailable (i.e. a disaster). Failover (to another location) is a fundamental part of disaster recovery.
Failover – is when one machine takes over from another machine, when the aforementioned machines fails. This could be within the same data center or from one data center to another. Failover is usually part of both high availability and disaster recovery planning.

Overview

Before you start, configure and set up your Confluence Data Center. See Set up a Confluence Data Center cluster.

This page describes what is generally referred to as a 'cold standby' strategy, which means the standby Confluence instance isn't continuously running and that you need to take some administrative steps to start the standby instance and ensure it's in a suitable state to service the business needs of your organization.

Maintaining a runbook

The detailed steps will vary from organization to organization and, as such, we recommend you keep a full runbook of steps on file, away from the production system it references. Make your runbook detailed enough such that anyone in the relevant team should be able to complete the steps and recover your service, regardless of prior knowledge or experience. We expect any runbook to contain steps that cover the following parts of the disaster recovery process:

Detection of the problem
Isolation of the current production environment and bringing it down gracefully
Synchronization of data between failed production and intended recovery point
Warm up instructions for the recovery instance
Documentation, communication, and escalation guidelines

The major components you need to consider in your disaster recovery plan are:

Confluence installation	Your standby site should have exactly the same version of Confluence installed as your production site.
Database	This is the primary source of truth for Confluence and contains most of the Confluence data (except for attachments, avatars, etc). You need to replicate your database and continuously keep it up to date to satisfy your RPO1
Attachments	All attachments are stored in the Confluence Data Center shared home directory and should be replicated to the standby instance. If Amazon S3 object storage is used for storing attachment data, then it should be handled separately.
Search Index	The search index isn't a primary source of truth, and can always be recreated from the database. For large installations, though, this can be quite time consuming and the functionality of Confluence will be greatly reduced until the index is fully recovered. Confluence Data Center stores Lucene search index backups in the shared home directory, which are covered by the shared home directory replication. If you are using OpenSearch, you will need to employ a different strategy using snapshots. See Step 2. Implement a data replication strategy for more details.
Apps	User installed apps are stored in the database and are covered by the database replication.
Other data	A few other non-critical items are stored in the Confluence Data Center shared home. Ensure they're also replicated to your standby instance.

Set up a standby system

Step 1. Install Confluence Data Center

Install the same version of Confluence on your standby system. Configure the system to attach to the standby database.

DO NOT start the standby Confluence system

Starting Confluence would write data to the database and shared home, which you do not want to do.

You may want to test the installation, in which case you should temporarily connect it to a different database and different shared home directory and start Confluence to make sure it works as expected. Don't forget to update the database configuration to point to the standby database and the shared home directory configuration to point to the standby shared home directory after your testing.

Step 2. Implement a data replication strategy

Replicating data to your standby location is crucial to a cold standby failover strategy. You don't want to fail over to your standby Confluence ins tance and find that it's out of date or that it takes many hours to re-index.

Database	All of the following Confluence supported database suppliers provide their own database replication solutions: Show me... Oracle: http://www.oracle.com/technetwork/database/features/data-integration/index.html PostgreSQL: https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial MySQL: http://dev.mysql.com/doc/refman/5.7/en/replication.html Microsoft SQL Server: http://msdn.microsoft.com/en-us/library/ms151198.aspx Amazon Aurora: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Replication.html#Aurora.Replication.AuroraPostgreSQL Azure PostgreSQL Flexible Server: Read replicas - Azure Database for PostgreSQL - Flexible Server You need to implement a database replication strategy that meets your RTO, RPO and RCO¹.
Files	You also need to implement a file server replication strategy for the Confluence shared home directory that meets your RTO, RPO and RCO¹.
Object Storage	If you are using Amazon S3 object storage for attachments, you will need to implement a replication strategy for these.
OpenSearch Index	If your Confluence instance is configured to use an OpenSearch cluster that is installed and managed on-premises, you can configure Cross Cluster replication. Alternatively, you can take regular snapshots and then use a file server replication strategy for copying to a cold standby. See the OpenSearch documentation on how to setup automatic or manual snapshots on your OpenSearch cluster. If your Confluence instance is using a managed service like AWS OpenSearch cluster, configure the replication as supported by the service provider.

Clustering considerations

For your clustered environment you need to be aware of the following, in addition to the information above:

Standby cluster	There's no need for the configuration of the standby cluster to reflect that of the live cluster. It may contain more or fewer nodes, depending on your requirements and budget. Fewer nodes may result in lower throughput, but that may be acceptable depending on your circumstances.
File locations	Where we mention `<confluencesharedhome>` as the location of files that need to be synchronized, we're referring to the shared home for the cluster. `<confluencelocalhome>` refers to the local home of the node in the cluster.
Starting the standby cluster	It's important to initially start only one node of the cluster, allow it to recover the search index, and check it's working correctly before starting additional nodes.

Disaster recovery testing

You should exercise extreme care when testing any disaster recovery plan. Simple mistakes may cause your live instance to be corrupted, for example, if testing updates are inserted into your production database. You may detrimentally impact your ability to recover from a real disaster, while testing your disaster recovery plan.

The key is to keep the main data center as isolated as possible from the disaster recovery testing.

This procedure will ensure that the standby environment will have all the right data, but as the testing environment is completely separate from the standby environment, possible configuration problems on the standby instance are not covered.

Prerequisites

Before you perform any testing, you need to isolate your production data.

Database

Temporarily pause all replication to the standby database
Replicate the data from the standby database to another database that's isolated and with no communication with the main database

Attachments, apps, and indexes

You need to ensure that no app updates or index backups occur during the test:

Disable index backups
Instruct sysadmins to not perform any updates in Confluence
Temporarily pause all replication to the standby shared home directory
Replicate the data from the standby shared home directory to another directory that's isolated and with no communication with the main shared home directory

Installation folders

Clone your standby installation separate from both the live and standby instances
Change the connection to the database in the <confluencelocalhome>/confluence.cfg.xml file to avoid any conflict
Change the location of the shared home directory in the <confluencelocalhome>/confluence.cfg.xml file to avoid any conflict
If using TCP/IP for cluster setup, change the IP addresses to that of your testing instances in <confluencelocalhome>/confluence.cfg.xml

After this you can resume all replication to the standby instance, including the database.

Perform disaster recovery testing

Once you have isolated your production data, follow the steps below to test your disaster recovery plan:

Ensure that the new database is ready, with the latest snapshot and no replication
Ensure that the new shared home directory is ready, with the latest snapshot and no replication
Ensure you have a copy of Confluence on a clean server with the right database and shared home directory settings in <confluencelocalhome>/confluence.cfg.xml
Ensure you have confluence.home mapped, as it was in the standby instance, in the test server
Disable email (See atlassian.mail.senddisabled in Configuring System Properties)
Start Confluence

Handling a failover

In the event your primary site is unavailable, you'll need to fail over to your standby system. The steps are as follows:

Ensure your live system is shutdown and no longer updating the database
Ensure the contents of <confluencesharedhome> is synced to your standby instance
Perform whatever steps are required to activate your standby database
Start Confluence on one node in the standby instance
Wait for Confluence to start and check it is operating as expected
Start up other Confluence nodes
Update your DNS, HTTP Proxy, or other front end devices to route traffic to your standby server

Returning to the primary instance

In most cases, you'll want to return to using your primary instance after you've resolved the problems that caused the disaster. This is easiest to achieve if you can schedule a reasonably-sized outage window.

You need to:

Synchronize your primary database with the state of the secondary
Synchronize the primary shared home directory with the state of the secondary

Perform the cut over

Shutdown Confluence on the standby instance
Ensure the database is synchronized correctly and configured to as required
Use rsync or a similar utililty to synchronize the shared home directory to the primary server
Start Confluence
Check that Confluence is operating as expected
Update your DNS, HTTP Proxy, or other front end devices to route traffic to your primary server

Other resources

Troubleshooting

If you encounter problems after failing over to your standby instance, check these FAQs for guidance:

What should I do if my database isn't synchronized correctly?

If your database doesn't have the data available that it should, then you'll need to restore the database from a backup.

Once you've restored your database, the search index will no longer by in sync with the database. You can either do a full re-index, background or foreground, or recover from the latest index snapshot if you have one. This includes the journal id file for each index snapshot. The index snapshot can be older than your database backup; it'll synchronize itself as part of the recovery process.

What should I do if my search index is corrupt?

If the search index is corrupt, you can either do a full re-index, background or foreground, or recover from an earlier index snapshot from the shared home directory if you have one.

What should I do if attachments are missing?

You may be able to recover them from backups if you have them, or recover from the primary site if you have access to the hard drives. Tools such as rsync may be useful in these circumstances. Missing attachments won't stop Confluence performing normally; the missing attachments won't be available, but users may be able to upload them again.

What happens to my application links during failover?

Application links are stored in the database. If the database replica is up to date, then the application links will be preserved.

You do, however, also need to consider how each end of the link knows the address of the other:

If you use host names to address the partners in the link and the backup Confluence server has the same hostname, via updates to the DNS or similar, then the links should remain intact and working.
If the application links were built using IP addresses and these aren't the same, then the application links will need to be re-established.
If you use IP addresses that are valid on the internal company network but your backup system is remote and outside the original firewall, you'll need to re-establish your application links.

Definitions

RPO	Recovery Point Objective	How up-to-date you require your Confluence instance to be after a failure.
RTO	Recovery Time Objective	How quickly you require your standby system to be available after a failure.
RCO	Recovery Cost Objective	How much you are willing to spend on your disaster recovery solution.

Products

Jira Software

Jira Service Management

Jira Work Management

Confluence

Bitbucket

Resources

Documentation

Community

System Status

Suggestions and bugs

Marketplace

Billing and licensing

Confluence Data Center disaster recovery

Confluence Data Center resources

On this page

Still need help?

Overview

Set up a standby system

Step 1. Install Confluence Data Center

Step 2. Implement a data replication strategy

Clustering considerations

Disaster recovery testing

Prerequisites

Perform disaster recovery testing

Handling a failover

Returning to the primary instance

Perform the cut over

Other resources

Troubleshooting

Definitions

Page

Viewport

Confluence

Confluence Data Center disaster recovery

Confluence Data Center resources

On this page

Related content

Still need help?

Overview

Set up a standby system

Step 1. Install Confluence Data Center

Step 2. Implement a data replication strategy

Clustering considerations

Disaster recovery testing

Prerequisites

Perform disaster recovery testing

Handling a failover

Returning to the primary instance

Perform the cut over

Other resources

Troubleshooting

Definitions

Related content