Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[cross-project-issues-dev] Build server outage


In investigating *why* the build server keeled at 1:52am ET, I found these artifacts interesting:

1.  CRON: (pkimlach) CMD /opt/public/technology/higgins/site-build/
-- that job runs every minute, and apparently looks for (and launches) a build

2. 01:33:05 build: Accepted publickey for nickb from

3. 01:40:08 build: (mknauer) CMD (/bin/bash /shared/technology/epp/epp_build/34/org.eclipse.epp/releng/org.eclipse.epp.config/

Although none of those killed the server, the combined efforts of many probably led to death by a thousand cuts.  At that time, the server was likely also busy signing, running a build for WTP, serving a few web pages, NNTP news articles, updating, etc.

To ensure the build server provides adequate service at all times, please consider the following :

1. Make sure your cron jobs detect the presence of an unfinished job!  This is especially true of those jobs that run every minute, 5 minutes, etc.  Although your job may only take seconds to run, when the server is very busy it could take minutes.  By then, you have 6 jobs running, which slows the server even more, spawning more jobs, until 300 jobs are running and the server explodes.
if [ ! -f $LOCKFILE ]; then
        touch $LOCKFILE
        # do the stuff
    echo "Another job is running, and I'm so confused.  Aborting this one."

2. The motd specifically states to not run builds between 00:00 and 2:00am local time.  Everyone assumes that our servers are perfectly idle at night, which is not true.

3. Before setting a cron job at some random time, observe the servers' load average over a 24-hour period, and choose your time accordingly.  Looking at the 24-hour graph below, it seems 6am - 8am local time, and 6:00pm - midnight are quite good.

4. Our servers are bored on Saturdays and Sundays!  Perfect time to run those CVS cleanup tasks, weekly builds, etc.

5. When running continuous builds, be considerate to others -- set your jobs as lower priority.  Props to AspectJ for doing this already by launching :
nohup nice ../cc271/

6. Ask the server if now is a good time to do something:
while [ $(awk -F. '{print $1}' /proc/loadavg) -gt  8 ]; do echo "Too busy to build.  Going to sleep."; sleep 60; done

If you have any questions, or if you'd like more tips on avoid server demolition, please don't hesitate to ask.  We're here to help.

Denis Roy
Manager, IT Infrastructure
Eclipse Foundation, Inc.  --

Back to the top