Confluence Data Center sample deployment and monitoring strategy

At Atlassian, we use several enterprise-scale development tools. Our Confluence Data Center instance is used by about 2,400 full-time Atlassian employees, globally.

How we use Confluence

Collaboration and open communication is vital to our culture, and much of this collaboration happens in Confluence. As of April 2018, we had 14,930,000 content items across 6,500 total spaces. On a six-hour snapshot of the instance's traffic, we saw an average of 341,000 HTTP calls per hour (with one hour peaking at 456,000 HTTP calls). Based on our Confluence Data Center load profiles, we'd categorise this instance as Large for both content and traffic.

Click here for more load and scale details

These details were collected on April 30, 2018.

Content

Statistic	Value
Attachments	2,424,000
Comments	1,976,000
Pages	686,000
Blog posts	60,000
Drafts	32,500

Scale

Statistic	Value
Total spaces	6,500
Site spaces	1,500
Personal spaces	5,000
Content (All versions)	14,930,000
Content (Current versions)	6,350,500
Local users	11,000
Local groups	8,500

Good to know:

Content (All versions) is the total number of all versions of all pages, blog posts, comments, and files in the instance. It's the total number of rows in the CONTENT table in the Confluence database.
Content (Current versions) is the number of pages, blog posts, comments, and files in the instance. It doesn't include historical versions of pages, blog posts, or files.
Local users is the number of user accounts from local and remote directories that have been synced to the database. It includes both active and inactive accounts.

Load

The following statistics are from 9:00am to 3:00pm (Sydney time) of a typical workday in April 2018. These hours capture peaks and dips in our Confluence usage, showing the kind of traffic the instance gets.

Statistic	Average	Peak
HTTP calls per hour	341,000	456,000
Concurrent users per hour	222	277

The Large traffic profile covers 350,000 to 700,000 HTTP calls per hour. The average here is just under the threshold, but we have a one-hour peak of 456,000. The instance accommodates both average and peak traffic, so we'd classify its load as Large.

Our Confluence Data Center instance is used heavily during working hours in each of our locations, so to keep all Atlassians happy and productive, we need to make sure it performs. The instance is highly available, so an individual node failure won't be fatal to the instance. This means we can focus more on maintaining acceptable performance, and less on keeping the site running.

Infrastructure and setup

Our Confluence Data Center instance is hosted on an Amazon Web Services (AWS) Virtual Private Cloud, and is made up of the following :

Function	Instance type	Number
Load balancer	AWS Application Load Balancer	2
Confluence application	c5.2xlarge (running Amazon Linux)	4
Synchrony application (for collaborative editing)	c5.large (running Amazon Linux)	2
Database (Amazon RDS Postgresql)	m4.xlarge	1
Shared home directory	Elastic File System (2.1 TiB)	1

Load distribution is managed by a proprietary Virtual Traffic Manager (VTM), and an application load balancer. The Atlassian VTM performs two functions:

Routing traffic between different Atlassian instances on the same domain, and
Terminating SSL for the Confluence Data Center instance.

The Synchrony cluster has two nodes, has XHR fallback enabled, and does not use the internal Synchrony proxy. The instance also uses 2.1 TiB of Amazon Elastic File System for storage.

Refer to the AWS documentation on Instance Types (specifically, General Purpose Instances and Compute-Optimized Instances) for details on each node type.

Integrated services

This Confluence Data Center instance has 70 user-installed apps enabled, and is linked to the following mix of Atlassian applications:

5 x Jira Software and Service Desk
7 x Confluence
2 x Bitbucket
11 x Bamboo
2 x Fisheye / Crucible

This instance is also connected to Crowd, for user management.

Monitoring strategy

The strategies described here are tailored to Atlassian's business needs and available resources. Your enterprise environment might have unique needs and quirks that require different strategies. Consult Atlassian Advisory Services for guidance.

Our performance monitoring strategy is built around targeting an Apdex of 0.7, but keeping it above 0.4. This index assumes that a 1-second response time is our Tolerating threshold, while anything beyond 4 seconds is our Frustrated threshold.

With the Apdex index, maintaining general satisfaction in the instance involves managing the ratio of "happy" and "unhappy" users. Refer to Apdex overview for more information.

Maintaining an Apdex level within Tolerating levels means actively monitoring the instance for potential problems that could cause major slowdowns. Many of these alerts, monitoring strategies, and action plans are based on previous incidents we've since learned to resolve quickly or avoid.

The following tables list our monitoring alerts, and what we do when they're triggered.

General load

Metric we track	Alerting level	What we do when alert is triggered
Long-running tasks Some tasks can cause memory problems when the space is too large.	Our monitoring tools send us an alert if a user starts any of these tasks: space export PDF export space deletion We also receive an alert for each hour the task continues to run.	If the task appears to be stuck and starts triggering other alerts, we'll usually restart the node and kill the task. We'll also contact the user to discuss other options. For space exports, this could mean reducing the size of the exported space, or exporting from a staging server.
Network throughput We check this to gauge instance health and detect any suspicious external activity (for example, DDoS).	20Mbps or higher (as of April 2018)	We investigate other metrics to see if anything (other than high user traffic) caused the increased throughput. Over time, we check how many times the throughput triggers the alert to see whether we need to tweak the infrastructure (and the alerting level) again.
Number of active database connections Having too many active database connections can slow down Confluence.	More than 1000 connections	The m4.xlarge node type supports a maximum of 1,320 connections. If a node triggers the alert and continues rising, we'll perform a rolling restart. We'll also raise a ticket against Confluence, as a bug (specifically, a database connection leak) could also trigger this alert. To date, our instance has never triggered this alert.
Node CPU usage High CPU use on either database or application nodes can indicate node-specific problems. These problems could slow down the instance.	We set two CPU usage alerts for application nodes: 50% (warning) 70% (error) Likewise, for the database node: 80% (warning) 95% (error)	When an application node triggers its warning alert, we perform a heap dump and thread dump and investigate further. We perform a rolling restart if we think that instance is about to crash. To date, the database node has always recovered on its own whenever it triggered its alerts.
Garbage Collector pauses Long pauses can cause problems with cluster membership. We also track this metric for development feedback. Pauses help us identify areas of Confluence that unnecessarily create a lot of objects. We analyze those areas for ways to improve Confluence.	Any Garbage Collector pause that lasts longer than 5 seconds	Usually this alert requires no action, but it can help warn us of possible outages. The data we collect here also helps us diagnose root causes of other outages. If the Garbage Collector triggers this alert frequently, we check if the instance requires heap tuning.

Load balancer

Metric we track	Alerting level	What we do when alert is triggered
Number of timeouts Requests timeout when back end nodes are unable to process them.	300 timeouts within a 1-hour period.	This many timeouts in a short amount of time is typically followed by other alerts. We investigate any triggered alerts and other metrics to see if an outage or similar problem is imminent.
Node health The load balancer regularly checks the health of cluster nodes. It disconnects nodes that fail the check, and re-connects them if they pass the check later. An outage occurs when all nodes are disconnected at the same time. See Load balancer configuration options for related details.	Whenever the load balancer disconnects or re-connects a node.	When a node disconnects, we check its state and restart it if necessary. We also check for other triggered alerts to see what could have caused the node to disconnect.
Internal server errors When the load balancer returns an internal server error (error code 500), it usually indicates that there are no back end nodes to process a request.	When the load balancer encounters more than 10 internal server errors in a second.	We check for other triggered alerts to see if there's a problem in a specific subsystem (for example, database or storage).

Storage

Metric we track

Alerting level

What we do when alert is triggered

I/O on file system of shared home
High I/O slows down file access, which can also lead to timeouts.

PercentIOLimit is greater than 98. This means that the shared home's file system I/O is now over 98% of its limit. See Monitoring with Amazon CloudWatch for more details.

We investigate whether we need to increase the I/O limit.

Disk space
We need to know early if we should expand disk space to accommodate usage or check further for storage-related issues.

We set two alerts for different levels of free disk space:

30% (warning)
10% (error)

If the amount of free space is running low, but the rate of consumption remains normal, we expand the available storage.

If the rate of disk consumption spikes abnormally, we check if there are misbehaving processes. This also involves checking the amount of disk space consumed within the last 24 hours.

Stability monitoring

The following alerts relate to the instance's overall stability. They don't get triggered as often, as most of the work we do to address performance problems also prevents outages.

Metric we track

Events we alert for

What we do when alert is triggered

Common error conditions
Whenever the instance suffers an outage, we check the logs for any new errors that led up to it. If we determine that a specific error signals an imminent outage we add alerts for it.

Common errors we set alerts for include (but aren't limited to):

Most of the errors we alert for are documented in the Confluence Knowledge Base, along with how to prevent them from causing an outage.

Hazelcast cluster membership
By default, Hazelcast removes and re-adds a node if it doesn't send a heartbeat within 30 seconds. We configured ours to do this within 60 seconds instead.

When a node is removed from and re-added to the cluster.

We perform a log analysis, thread dump, and heap dump on the affected node. If any of these show a cluster panic or outage is imminent, we perform a rolling restart.

Cluster panics
These are severe failures, and usually require a full Confluence restart.

Every time a cluster panic occurs. Our monitoring tools check the application logs for events that correlated to panics the instance experienced in the past.

We perform a full Confluence restart. This means shutting down all application nodes, and starting them up one by one.

See Data Center Troubleshooting and Confluence Data Center Cluster Troubleshooting for related information.

Non-alerted metrics

We don't set any alerts for the following metrics, but we monitor them regularly for abnormal spikes or dips. We also check them whenever other alerts get triggered, or if the Apdex starts to drop.

Metric we track	Monitoring practice
JVM memory We regularly monitor the instance to ensure that the JVM doesn't run out of memory. See Confluence crashes due to 'OutOfMemoryError Java heap space' error for related details.	When JVM starts running low on memory, we perform a heap dump and thread dump. We perform a rolling restart if either dump shows that a crash is imminent.
Number of active HTTP threads Having too many threads in constant use can indicate an application deadlock.	If abnormal spikes in this metric coincide with deadlocks or similar problems, we restart each affected node. If they don't, we just tune thread limits or take thread dumps for further investigation. See Health Check: Thread Limit for related information.

Products

Jira Software

Jira Service Management

Jira Work Management

Confluence

Bitbucket

Resources

Documentation

Community

System Status

Suggestions and bugs

Marketplace

Billing and licensing

Confluence Data Center sample deployment and monitoring strategy

Confluence Data Center Performance

On this page

Still need help?

How we use Confluence

Content

Scale

Load

Infrastructure and setup

Integrated services

Monitoring strategy

General load

Load balancer

Storage

Stability monitoring

Non-alerted metrics

We're here to help

Page

Viewport

Confluence

Confluence Data Center sample deployment and monitoring strategy

Confluence Data Center Performance

On this page

Related content

Still need help?

How we use Confluence

Content

Scale

Load

Infrastructure and setup

Integrated services

Monitoring strategy

General load

Load balancer

Storage

Stability monitoring

Non-alerted metrics

We're here to help

Related content