Jira Data Center cache replication

The cache replication in Jira Data Center 7.9 and later is asynchronous, which means that cache modifications aren’t replicated to other nodes immediately after they occur, but are instead added to local queues. They are then replicated in the background based on their order in the queue. 

This approach helps us to:

  • Improve the scalability of the cluster
  • Reduce the amount of cache inconsistencies between the nodes
  • Separate the replication from any cache modifications, simplifying and speeding up the whole process

Replicating cache modifications

Local queues

Before we can queue cache modifications, we need to create local queues on each of your nodes. There are separate queues created for each node, so that modifications are properly grouped and ordered.

The queues are created automatically. Whenever you add a node to the cluster, the remaining nodes detect it and create a new queue on their file system. The nodes retrieve information about the whole cluster from the database by using the following query:

select node_id, node_state from clusternode


The output of this query looks similar to this (you'll see more information, like port number or IP address):

node_idnode_state
Node 1ACTIVE
Node 2ACTIVE
Node 3ACTIVE
Node 4OFFLINE


Each node creates 10 separate queues on its file system for each node that is in the active state. We've chosen 10 queues to increase the throughput—we can't speed up the replication of a single modification, because there are a number of actions that must be completed, but we can replicate 10 modifications at the same time.

In a 3-node cluster with Node 1, Node 2, and Node 3, the following queues will be created:

On Node 1:

  • 10 queues for replicating modifications to Node 2
  • 10 queues for replicating modifications to Node 3

On Node 2:

  • 10 queues for replicating modifications to Node 1
  • 10 queues for replicating modifications to Node 3

On Node 3:

  • 10 queues for replicating modifications to Node 1
  • 10 queues for replicating modifications to Node 2


How does this look on the file system?

The queues are created in the <local-home-directory>/localq directory on each node. The contents of this directory look similar to this:

/localq directory on Node 1
queue_Node2_0_f5f366263dcc357e2720042f33286f8f
..
queue_Node2_9_f5f366263dcc357e2720042f33286f8f
queue_Node3_0_bbb747b1516b5225f1ec1c65887b39fc
..
queue_Node3_9_bbb747b1516b5225f1ec1c65887b39fc 


Once the queues are created, they're ready to queue cache modifications. Whenever you add a node to the cluster, another 10 queues for this node will be created on each existing node. The new node will create queues for all existing active nodes.

Adding cache modifications to local queues

After all the queues are in place, each caching event that occurs on a specific node can be added to the right queue.

Let’s use Node 1 as an example.

On Node 1:

  1. A caching event occurs (for example, you make changes to the permission scheme).
  2. The changes are made in the database, and the local cache related to this change is removed.
  3. A request to remove the cache is added to the following local queues, from which it will be replicated to the right nodes.
    • queue_Node2 for Node 2
    • queue_Node3 for Node 3.

Modifications occurring in a single thread will always be added to the same queue. This is to keep the order in which they'll be replicated in case the two events within a thread are dependant on each other.

Replicating cache modifications from local queues to other nodes

After the cache modifications are added to local queues, they're being handled by another thread. Each queue has a single thread responsible for:

  • Reading a cache modification request from the queue.
  • Delivering the cache modification request to the destination node over RMI.
  • Removing the cache modification request from the queue.


To use the same example where the cache modification was added to queue_Node2 and queue_Node3 on Node 1, the next steps for this modification are the following:

  • A thread responsible for handling queued modifications reads the modification from local queue queue_Node2, and tries to deliver it over RMI to Node 2.
  • Another thread responsible for handling queued modifications reads the modification from local queue queue_Node3, and tries to deliver it over RMI to Node 3.

If modifications are successfully replicated, they're removed from the queue. If the replication failed, errors will be written in the logs.

Monitoring cache replication

You can monitor cache replication by reviewing statistics that are written in the log file.  They’ll show you the size of the local queues, and whether cache modifications are successfully replicated or persisted in the queues for too long. In most cases, monitoring just a few parameters will tell you if the replication is working properly.

For more info, see Monitoring the cache replication.

Configuring cache replication

You can configure some options of cache replication, such as the maximum number of modifications in the queue or the frequency of saving the statistics.

Note that these are system properties. 


Show configuration options...
OptionDefault valueDescription
jira.cache.replication.localq.time.sync.tolerance.millis900 000 (15 minutes)

Sets the time sync tolerance between the nodes (milliseconds.)


jira.cache.replication.localq.queue.max.size100 000

Specifies the maximum number of cache modifications in a single local queue.

Deprecated since JIRA 8.16. Replaced with: com.atlassian.jira.stats.logging.interval
jira.cache.replication.localq.queue.stats.logging.interval.seconds

N

Changes the frequency of saving the snapshot statistics (in seconds).

More details

Expand the sections below to understand the cache replication in more detail.


Skipping cache modifications...

Cache modifications won't be added to the local queue if the queue has already reached the maximum size (100,000 modifications by default). This shouldn't happen if the replication is working properly, because after the cache modifications are replicated they're immediately removed from the queue.

If a modification isn't added to the queue, you can recognize it by the following entry in the log file:

Not enough space left for persisting localQCacheOp, usableSpaceInBytes: {}, estimateObjectSizeInBytes: {}. Skipping replication of this cache event.

You can also increase the default size of the queue. See Configuring cache replication above.

Serializing cache modifications...

Cache modification requests are serialized (converted into bytes) before being added to the queue, and deserialized when being read from the queue. We're using a standard Java serialization, the same mechanism is used when sending the requests over RMI.

Recoverable errors...

Recoverable errors are generally temporary problems, such as a node being offline for a short period of time or some network problems that stop the replication. All of these should pass once you resolve the problem, in which case the replication will resume and deliver the delayed modifications. 

You can recognize these errors by the following entry in the log file: 

Checked exception: {exception name} occurred when processing: {cache replication request} from cache replication queue: {queue name}, failuresCount: {}. Will retry indefinitely.

You can also see how many recoverable errors occurred by checking the sendCheckedExceptionCounter parameter in the statistics.

Unrecoverable errors...

Unrecoverable errors may result in losing some cache modifications. They're usually caused by the cache modification requests themselves. An example of such a request is when the destination node is unable to deserialize the modification because of a missing plugin or a different Jira version. 

To keep the cache replication going, these modifications are skipped after a few unsuccessful replications and aren't delivered to other nodes.

You can recognize these errors by the following entries in the log file:

Runtime exception: {exception-name} occurred when processing: {cache-replication-message} from cache replication queue: {queue-name}, failuresCount: {}/{}, error: {exception}
Abandoning sending: {cache-replication-message} from cache replication queue: {queue-name}, failuresCount: {}/{}. Removing from queue. Error: {exception}"

You can also see how many unrecoverable errors occurred by checking the sendRuntimeExceptionCounter parameter in the statistics.

If these problems persist...

It might result in your cluster getting out of sync. To solve this problem, you can clear the cache and rebuild it from scratch. To do this, shut down your nodes one by one and then start them again. The nodes will load all the values from the database and start building the new cache.

Sending requests over RMI...

To make the replication faster, we’re also caching the Ehcache’s CachePeer (object representing the cache on a different node), so we don’t have to create it every time a cache modification is needed, but rather reuse the cached version for all modifications.

When looking up the CachePeer in the remote RMI registry, the connection details are taken from the clusternode table in the database. They’re accurate, so the connection will work. However, when later replicating through its cached version, the connection details are set by the remote node, and it might happen that both the hostname and the port number are incorrect, and modifications can’t be replicated.

You can recognize this problem by the following entries in the log file:

Cache {cache-name} requested bootstrap but a CacheException occured.
java.rmi.ConnectIOException: Exception creating connection to: {destination_hostname}

To solve the problem, access each of your nodes and enter their connection details in the following properties:

System property

java.rmi.server.hostname

Cluster.properties file

ehcache.object.port

Last modified on Jan 15, 2021

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.