Crowd freezes when multiple applications make API calls at the same time
Symptoms
- Crowd server thread dumps show that more than one application is making calls to Crowd's API, and these calls are causing database congestion by not allowing efficient access to the LDAP-DB cache in Crowd. You can confirm this by identifying API calls that are in "WAIT for DB" state.
- Crowd 2.1 or newer versions are being used
Cause
Depending on what API calls are made by the applications, the call result will be the return of all the data cached (i.e., findAllGroupRelationships). Therefore, for big LDAP instances being cached in the database, this situation can cause Crowd to "freeze" for some minutes.
Resolution
Crowd will always use all the resources provided to it. The memory, database connections, and CPU assigned to the server can be increased. However this is not a good solution since Crowd may reach the resource limit again.
Keeping this in mind, the correct approach would be to make sure that no more than one application is going to make heavy API requests to Crowd at the same time. Also, ensure that the LDAP directory pooling intervals are never going to be the same.
Since all the applications are using the Crowd Integration client, which uses the ehcache, we have the opportunity to set different cache timeouts using the application's crowd-ehcache.xml
files.
1. The LDAP-DB cache pooling interval
Suggestion: have a difference of 7 minutes for each cache interval.
Example:
Directory-1: 60 minutes
Directory-2: 67 minutes
Directory-3: 74 minutes
2. The application cache intervals defined at <App>/WEB-INF/classes/crowd-ehcache.xml file
Suggestion: have a difference of 7 minutes between each application
Examples:
Application-1:
- timeToIdleSeconds="3600"
- timeToLiveSeconds="3600"
Application-2:
- timeToIdleSeconds="4020"
- timeToLiveSeconds="4020"
Application-3:
- timeToIdleSeconds="4440"
- timeToLiveSeconds="4440"
For each application, all the crowd-ehcache.xml
file caches must have the exact same timeToIdleSeconds and timeToLiveSeconds.
This KB suggests seven minutes. However, you can use any time frame that will ensure that the caches have their least common multiple reached less times.
Modifying later version of Confluence/JIRA applications's cache intervals can be done via Confluence/JIRA Admin >> User Directories >> Crowd Server
through Synchronisation Interval (minutes) field.