EC2 Build Slave Monitoring

We have been experiencing various issues with hoarded and stuck Jenkins build slave EC2 instances. Such regularly repeating cases cause direct monetary impact on Linaro. After various attempts to resolve these cases, it was decided to additionally have slave monitoring on the level completely external to Jenkins. So, sufficiently simple Python script was developed, which regularly queries active EC2 instances, filters out one which represent build slaves, and then applies series of checks to decide if any slave is stuck. If such slave is detected, it is reported via email, for Infrastructure team engineers in charge to review the situation and take actions. It should be noted that not all cases of stuck or vice-versa, not stuck, slaves can be reliably detected, so there may be gaps in coverage or false positives. It is expected that human monitoring will help elaborate criteria for robust detection of stuck slaves, and such will be shutdown automatically be the script.

The script is here and is deployed on android-build.linaro.org as cronjob running every 20mins.

Platform/Systems/EC2SlaveMonitoringScript (last modified 2014-06-24 17:47:35)