Builds running on Elastic agents start failing after upgrading to Bamboo 8+ versions
Platform notice: Server and Data Center only. This article only applies to Atlassian products on the Server and Data Center platforms.
Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
After upgrading to a Bamboo 8.0.0+ version, some builds running on Elastic Agents start to get canceled with the following message being thrown in the atlassian-bamboo.log file:
2022-05-18 17:06:59,091 ERROR [scheduler_Worker-4] [OrphanedBuildMonitorJob] Build PROJ-PLAN-JOB1-4 had to be cancelled: it was marked as in progress in DB but no agents were assigned to it.
Environment
- Bamboo 8+
- Elastic agents
Diagnosis
You should be able to find the following messages for the affected build plans on the server-side log file (atlassian-bamboo.log):
2022-05-18 17:06:59,091 ERROR [scheduler_Worker-4] [OrphanedBuildMonitorJob] Build PROJ-PLAN-JOB2-4 had to be cancelled: it was marked as in progress in DB but no agents were assigned to it.
2022-05-18 17:31:59,100 ERROR [scheduler_Worker-7] [OrphanedBuildMonitorJob] Build PROJ-PLAN2-JOB1-3 had to be cancelled: it was marked as in progress in DB but no agents were assigned to it.
2022-05-18 18:43:59,080 ERROR [scheduler_Worker-8] [OrphanedBuildMonitorJob] Build PROJ-TSTPLAN-TST-16 had to be cancelled: it was marked as in progress in DB but no agents were assigned to it.
2022-05-18 18:56:59,081 ERROR [scheduler_Worker-10] [OrphanedBuildMonitorJob] Build PROJ-TST-JOB1-1 had to be cancelled: it was marked as in progress in DB but no agents were assigned to it.
2022-05-18 19:03:59,102 ERROR [scheduler_Worker-7] [OrphanedBuildMonitorJob] Build PROJ-PRD-AUX-4 had to be cancelled: it was marked as in progress in DB but no agents were assigned to it.
On the Elastic agent side (bamboo-elastic-agent.out), you should see that the JVM is getting restarted while running the build due to delays to respond the ping requests:
INFO | jvm 1 | 2022/05/18 19:33:37 | 2022-05-18 19:33:37,931 INFO [0-BAM::Elastic Agent on i-0fbde9f53e6967XXX::Agent:pool-3-thread-1] [ProcessServiceImpl] Beginning to execute external process for build 'Plan - Plan - Test plan for trhoubleshooting purposes #4 (PROJ-PLAN-JOB1-4)'\n ... running command line: \nC:\Users\Bamboo\bamboo-agent-home\temp\PROJ-PLAN-JOB1-4-ScriptBuildTask-2387492739842798374.bat\n ... in: C:\build\PROJ-PLAN-JOB1-4\zone\n
INFO | jvm 1 | 2022/05/18 19:35:01 | 2022-05-18 19:35:01,318 INFO [remoteEventRebroadcasterMessageListenerContainer-1]
(...)
INFO | jvm 1 | 2022/05/18 20:00:02 | 2022-05-18 20:00:02,871 INFO [remoteEventRebroadcasterMessageListenerContainer-1] [RemoteEventRebroadcasterMessageListener] ExecutableQueueUpdate: addToQueue, agents known to be affected: []
INFO | jvm 1 | 2022/05/18 20:01:14 | 2022-05-18 20:01:14,859 INFO [remoteEventRebroadcasterMessageListenerContainer-1] [RemoteEventRebroadcasterMessageListener] ExecutableQueueUpdate: addToQueue, agents known to be affected: []
INFO | jvm 1 | 2022/05/18 20:01:17 | 2022-05-18 20:01:17,812 INFO [remoteEventRebroadcasterMessageListenerContainer-1] [RemoteEventRebroadcasterMessageListener] ExecutableQueueUpdate: addToQueue, agents known to be affected: []
STATUS | wrapper | 2022/05/18 20:04:41 | Pinging the JVM took 29 seconds to respond.
STATUS | wrapper | 2022/05/18 20:04:41 | Pinging the JVM took 25 seconds to respond.
STATUS | wrapper | 2022/05/18 20:04:41 | Pinging the JVM took 20 seconds to respond.
STATUS | wrapper | 2022/05/18 20:04:41 | Pinging the JVM took 15 seconds to respond.
STATUS | wrapper | 2022/05/18 20:04:41 | Pinging the JVM took 10 seconds to respond.
INFO | jvm 1 | 2022/05/18 20:05:16 | 2022-05-18 20:05:15,972 INFO [remoteEventRebroadcasterMessageListenerContainer-1] [RemoteEventRebroadcasterMessageListener] ExecutableQueueUpdate: addToQueue, agents known to be affected: []
INFO | jvm 1 | 2022/05/18 20:05:33 | 2022-05-18 20:05:33,097 INFO [remoteEventRebroadcasterMessageListenerContainer-1] [RemoteEventRebroadcasterMessageListener] ExecutableQueueUpdate: addToQueue, agents known to be affected: []
(...)
INFO | jvm 1 | 2022/05/18 20:18:55 | 2022-05-18 20:18:55,590 INFO [remoteEventRebroadcasterMessageListenerContainer-1] [RemoteEventRebroadcasterMessageListener] ExecutableQueueUpdate: invalidate executables: Elastic Agent on i-0c4fe028111ddd334, agents known to be affected: [928392832]
STATUS | wrapper | 2022/05/18 20:19:40 | Pinging the JVM took 11 seconds to respond.
INFO | jvm 1 | 2022/05/18 20:20:03 | 2022-05-18 20:20:03,143 INFO [remoteEventRebroadcasterMessageListenerContainer-1] [RemoteEventRebroadcasterMessageListener] ExecutableQueueUpdate: addToQueue, agents known to be affected: []
(...)
INFO | jvm 1 | 2022/05/18 20:30:18 | 2022-05-18 20:30:18,396 INFO [remoteEventRebroadcasterMessageListenerContainer-1] [RemoteEventRebroadcasterMessageListener] ExecutableQueueUpdate: addToQueue, agents known to be affected: []
STATUS | wrapper | 2022/05/18 20:31:13 | JVM appears hung: Timed out waiting for signal from JVM. Restarting JVM.
STATUS | wrapper | 2022/05/18 20:31:13 | JVM exited after being requested to terminate.
INFO | wrapper | 2022/05/18 20:31:33 | Wrapper Process has not received any CPU time for 13 seconds. Extending timeouts.
STATUS | wrapper | 2022/05/18 20:31:33 | Launching a JVM...
INFO | jvm 2 | 2022/05/18 20:31:37 | WrapperManager: Initializing...
Please see Viewing an elastic instance for more details on how to check the logs for Elastic Agents.
Cause
This happens because starting from Bamboo 8.0, Elastic Agents are using the agent supervisor for monitoring the agent process. If an elastic agent crashes, the Elastic Agent supervisor will automatically attempt to restart it, and the Elastic Agent will try to reconnect to a Bamboo node.
As a side effect of that, if your build requires the JVM to perform a long blocking task, the supervisor might wrongly assume that the Agent JVM is stuck and restart it in the middle of build execution, causing these builds to end up with inconsistent results.
You can find more details about the agent supervisor on the following page:
Solution
To workaround this situation, we need to adjust the default action for the agent supervisor when the JVM fails to respond to its ping requests by adding the following startup scripts at the AMI configuration:
The default action for the wrapper.ping.timeout.action is RESTART meaning that the agent supervisor will restart the JVM when it suspects it's hugging, by changing it to DUMP as suggested above, it will instead request a thread dump that will be outputted at the bamboo-elastic-agent.out log file, this will be useful in case we need to investigate why the JVM is failing to respond supervisor's ping requests.