Confluence Data Center disaster recovery

A disaster recovery strategy is a key part of any business continuity plan. It outlines the processes to follow in the event of a disaster, to ensure that the business can recover and keep operating. For Confluence, this means ensuring Confluence's availability in the event that your primary site becomes unavailable.

Confluence Data Center is the only Atlassian-supported high-availability solution for Confluence. 

This page demonstrates how you can use Confluence Data Center 5.9 or later in implementing and managing a disaster recovery strategy for Confluence. It doesn't, however, cover the broader business practices, like setting the key objectives (RTO, RPO & RCO1), and standard operating procedures.

What's the difference between high availability and disaster recovery?

The terms "high availability", "disaster recovery" and "failover" can often be confused. For the purposes of this page, we've defined them as follows:

  • High availability – A strategy to provide a specific level of availability. In Confluence's case, access to the application and an acceptable response time. Automated correction and failover (within the same location) are usually part of high-availability planning.
  • Disaster recovery – A strategy to resume operations in an alternate data center (usually in another geographic location), if the main data center becomes unavailable (i.e. a disaster). Failover (to another location) is a fundamental part of disaster recovery. 
  • Failover – is when one machine takes over from another machine, when the aforementioned machines fails. This could be within the same data center or from one data center to another. Failover is usually part of both high availability and disaster recovery planning.

Overview

Before you start, configure and set up your Confluence Data Center. See Set up a Confluence Data Center cluster.

This page describes what is generally referred to as a 'cold standby' strategy, which means the standby Confluence instance isn't continuously running and that you need to take some administrative steps to start the standby instance and ensure it's in a suitable state to service the business needs of your organization.

Maintaining a runbook

The detailed steps will vary from organization to organization and, as such, we recommend you keep a full runbook of steps on file, away from the production system it references. Make your runbook detailed enough such that anyone in the relevant team should be able to complete the steps and recover your service, regardless of prior knowledge or experience. We expect any runbook to contain steps that cover the following parts of the disaster recovery process:

  1. Detection of the problem
  2. Isolation of the current production environment and bringing it down gracefully
  3. Synchronization of data between failed production and intended recovery point
  4. Warm up instructions for the recovery instance
  5. Documentation, communication, and escalation guidelines

The major components you need to consider in your disaster recovery plan are:

Confluence installationYour standby site should have exactly the same version of Confluence installed as your production site.
DatabaseThis is the primary source of truth for Confluence and contains most of the Confluence data (except for attachments, avatars, etc). You need to replicate your database and continuously keep it up to date to satisfy your RPO1
Attachments

All attachments are stored in the Confluence Data Center shared home directory and should be replicated to the standby instance.

If Amazon S3 object storage is used for storing attachment data, then it should be handled separately.

Search Index

The search index isn't a primary source of truth, and can always be recreated from the database. For large installations, though, this can be quite time consuming and the functionality of Confluence will be greatly reduced until the index is fully recovered. Confluence Data Center stores Lucene search index backups in the shared home directory, which are covered by the shared home directory replication.

If you are using OpenSearch, you will need to employ a different strategy using snapshots. See Step 2. Implement a data replication strategy for more details.

AppsUser installed apps are stored in the database and are covered by the database replication.
Other dataA few other non-critical items are stored in the Confluence Data Center shared home. Ensure they're also replicated to your standby instance.

Set up a standby system

Step 1. Install Confluence Data Center

Install the same version of Confluence on your standby system. Configure the system to attach to the standby database.

DO NOT start the standby Confluence system

Starting Confluence would write data to the database and shared home, which you do not want to do.

You may want to test the installation, in which case you should temporarily connect it to a different database and different shared home directory and start Confluence to make sure it works as expected. Don't forget to update the database configuration to point to the standby database and the shared home directory configuration to point to the standby shared home directory after your testing.

 Step 2. Implement a data replication strategy

Replicating data to your standby location is crucial to a cold standby failover strategy. You don't want to fail over to your standby Confluence ins tance and find that it's out of date or that it takes many hours to re-index.

Database

All of the following Confluence supported database suppliers provide their own database replication solutions:

You need to implement a database replication strategy that meets your RTO, RPO and RCO1

Files

You also need to implement a file server replication strategy for the Confluence shared home directory that meets your RTO, RPO and RCO1.

Object Storage

If you are using Amazon S3 object storage for attachments, you will need to implement a replication strategy for these.

OpenSearch Index

If your Confluence instance is configured to use an OpenSearch cluster that is installed and managed on-premises, you can configure Cross Cluster replication

Alternatively, you can take regular snapshots and then use a file server replication strategy for copying to a cold standby. See the OpenSearch documentation on how to setup automatic or manual snapshots on your OpenSearch cluster.

If your Confluence instance is using a managed service like AWS OpenSearch cluster, configure the replication as supported by the service provider.

Clustering considerations

For your clustered environment you need to be aware of the following, in addition to the information above:

Standby cluster

There's no need for the configuration of the standby cluster to reflect that of the live cluster. It may contain more or fewer nodes, depending on your requirements and budget. Fewer nodes may result in lower throughput, but that may be acceptable depending on your circumstances.

File locations

Where we mention  <confluencesharedhome> as the location of files that need to be synchronized, we're referring to the shared home for the cluster. <confluencelocalhome> refers to the local home of the node in the cluster.

Starting the standby clusterIt's important to initially start only one node of the cluster, allow it to recover the search index, and check it's working correctly before starting additional nodes.

 Disaster recovery testing

You should exercise extreme care when testing any disaster recovery plan. Simple mistakes may cause your live instance to be corrupted, for example, if testing updates are inserted into your production database. You may detrimentally impact your ability to recover from a real disaster, while testing your disaster recovery plan.  

The key is to keep the main data center as isolated as possible from the disaster recovery testing.

This procedure will ensure that the standby environment will have all the right data, but as the testing environment is completely separate from the standby environment, possible configuration problems on the standby instance are not covered.

Prerequisites

Before you perform any testing, you need to isolate your production data.  

Database
  1. Temporarily pause all replication to the standby database
  2. Replicate the data from the standby database to another database that's isolated and with no communication with the main database
Attachments, apps, and indexes

You need to ensure that no app updates or index backups occur during the test:

  1. Disable index backups
  2. Instruct sysadmins to not perform any updates in Confluence
  3. Temporarily pause all replication to the standby shared home directory
  4. Replicate the data from the standby shared home directory to another directory that's isolated and with no communication with the main shared home directory


Installation folders
  1. Clone your standby installation separate from both the live and standby instances
  2. Change the connection to the database in the <confluencelocalhome>/confluence.cfg.xml file to avoid any conflict
  3. Change the location of the shared home directory in the <confluencelocalhome>/confluence.cfg.xml file to avoid any conflict
  4. If using TCP/IP for cluster setup, change the IP addresses to that of your testing instances in <confluencelocalhome>/confluence.cfg.xml



After this you can resume all replication to the standby instance, including the database. 

Perform disaster recovery testing

Once you have isolated your production data, follow the steps below to test your disaster recovery plan:

  1. Ensure that the new database is ready, with the latest snapshot and no replication
  2. Ensure that the new shared home directory is ready, with the latest snapshot and no replication
  3. Ensure you have a copy of Confluence on a clean server with the right database and shared home directory settings in <confluencelocalhome>/confluence.cfg.xml
  4. Ensure you have confluence.home mapped, as it was in the standby instance, in the test server
  5. Disable email (See atlassian.mail.senddisabled in Configuring System Properties)
  6. Start Confluence  

Handling a failover

In the event your primary site is unavailable, you'll need to fail over to your standby system. The steps are as follows:  

  1. Ensure your live system is shutdown and no longer updating the database
  2. Ensure the contents of <confluencesharedhome> is synced to your standby instance
  3. Perform whatever steps are required to activate your standby database
  4. Start Confluence on one node in the standby instance
  5. Wait for Confluence to start and check it is operating as expected
  6. Start up other Confluence nodes
  7. Update your DNS, HTTP Proxy, or other front end devices to route traffic to your standby server 

Returning to the primary instance

In most cases, you'll want to return to using your primary instance after you've resolved the problems that caused the disaster. This is easiest to achieve if you can schedule a reasonably-sized outage window.  

You need to: 

  • Synchronize your primary database with the state of the secondary
  • Synchronize the primary shared home directory with the state of the secondary

Perform the cut over 

  1. Shutdown Confluence on the standby instance
  2. Ensure the database is synchronized correctly and configured to as required
  3. Use  rsync  or a similar utililty to synchronize the shared home directory to the primary server
  4. Start Confluence
  5. Check that Confluence is operating as expected
  6. Update your DNS, HTTP Proxy, or other front end devices to route traffic to your primary server

Other resources 

Troubleshooting 

If you encounter problems after failing over to your standby instance, check these FAQs for guidance:

What should I do if my database isn't synchronized correctly?

If your database doesn't have the data available that it should, then you'll need to restore the database from a backup.

Once you've restored your database, the search index will no longer by in sync with the database. You can either do a full re-index, background or foreground, or recover from the latest index snapshot if you have one. This includes the journal id file for each index snapshot. The index snapshot can be older than your database backup; it'll synchronize itself as part of the recovery process.

 

What should I do if my search index is corrupt?

If the search index is corrupt, you can either do a full re-index, background or foreground, or recover from an earlier index snapshot from the shared home directory if you have one. 

 

What should I do if attachments are missing?

You may be able to recover them from backups if you have them, or recover from the primary site if you have access to the hard drives. Tools such as rsync may be useful in these circumstances. Missing attachments won't stop Confluence performing normally; the missing attachments won't be available, but users may be able to upload them again.

 

What happens to my application links during failover?

Application links are stored in the database. If the database replica is up to date, then the application links will be preserved.

You do, however, also need to consider how each end of the link knows the address of the other:

  • If you use host names to address the partners in the link and the backup Confluence server has the same hostname, via updates to the DNS or similar, then the links should remain intact and working. 
  • If the application links were built using IP addresses and these aren't the same, then the application links will need to be re-established. 
  • If you use IP addresses that are valid on the internal company network but your backup system is remote and outside the original firewall, you'll need to re-establish your application links.

Definitions

RPORecovery Point ObjectiveHow up-to-date you require your Confluence instance to be after a failure.
RTORecovery Time ObjectiveHow quickly you require your standby system to be available after a failure.
RCORecovery Cost Objective

How much you are willing to spend on your disaster recovery solution.

Last modified on Dec 6, 2017

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.