Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [cross-project-issues-dev] Build server outage

David M Williams wrote:

I must say I am very disappointed that WTP is not higher on the list of suspects ... we'll try harder in the future. :)
Well, I see you log into the build server exactly every minute, but I usually pin the world's problems on WTP, so I figured I'd cut you some slack this time.

This time, I'm simply blaming it on the Nick Boldt[]  :)



But, seriously ... I appreciate the list of tips.

And, as someone who does his best work between midnight and 2 AM I am constantly amazed at how many people think that is a good to schedule something on the server under the assumption that everyone's asleep :)






From: Denis Roy <denis.roy@xxxxxxxxxxx>
To: Cross project issues <cross-project-issues-dev@xxxxxxxxxxx>
Date: 06/05/2008 11:41 AM
Subject: [cross-project-issues-dev] Build server outage





Folks,

In investigating *why* the build server keeled at 1:52am ET, I found these artifacts interesting:

1.  CRON: (pkimlach) CMD /opt/public/technology/higgins/site-build/finder.sh
-- that job runs every minute, and apparently looks for (and launches) a build

2. 01:33:05 build: Accepted publickey for nickb from 209.217.126.109

3. 01:40:08 build: (mknauer) CMD (/bin/bash /shared/technology/epp/epp_build/34/org.eclipse.epp/releng/org.eclipse.epp.config/startEPP34.sh


Although none of those killed the server, the combined efforts of many probably led to death by a thousand cuts.  At that time, the server was likely also busy signing, running a build for WTP, serving a few web pages, NNTP news articles, updating PlanetEclipse.org, etc.

To ensure the build server provides adequate service at all times, please consider the following :

1. Make sure your cron jobs detect the presence of an unfinished job!  This is especially true of those jobs that run every minute, 5 minutes, etc.  Although your job may only take seconds to run, when the server is very busy it could take minutes.  By then, you have 6 jobs running, which slows the server even more, spawning more jobs, until 300 jobs are running and the server explodes.

#!/bin/bash
LOCKFILE=/tmp/technology.babel.minutejob
if [ ! -f $LOCKFILE ]; then
        touch $LOCKFILE
        # do the stuff
else
    echo "Another job is running, and I'm so confused.  Aborting this one."
fi
rm $LOCKFILE

2. The motd specifically states to not run builds between 00:00 and 2:00am local time.  Everyone assumes that our servers are perfectly idle at night, which is not true.

3. Before setting a cron job at some random time, observe the servers' load average over a 24-hour period, and choose your time accordingly.  Looking at the 24-hour graph below, it seems 6am - 8am local time, and 6:00pm - midnight are quite good.

https://dev.eclipse.org/committers/loadstats/showmonthstats.php?server=/home/data/common/monitor/loadstats/build&year=2008&month=6&day=4

4. Our servers are bored on Saturdays and Sundays!  Perfect time to run those CVS cleanup tasks, weekly builds, etc.

5. When running continuous builds, be considerate to others -- set your jobs as lower priority.  Props to AspectJ for doing this already by launching :
nohup nice ../cc271/cruisecontrol.sh

6. Ask the server if now is a good time to do something:
while [ $(awk -F. '{print $1}' /proc/loadavg) -gt  8 ]; do echo "Too busy to build.  Going to sleep."; sleep 60; done


If you have any questions, or if you'd like more tips on avoid server demolition, please don't hesitate to ask.  We're here to help.

--
Denis Roy
Manager, IT Infrastructure
Eclipse Foundation, Inc.  --  
http://www.eclipse.org/

_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/cross-project-issues-dev


_______________________________________________ cross-project-issues-dev mailing list cross-project-issues-dev@xxxxxxxxxxx https://dev.eclipse.org/mailman/listinfo/cross-project-issues-dev

-- 
Denis Roy
Manager, IT Infrastructure
Eclipse Foundation, Inc.  --  http://www.eclipse.org/
Office: 613.224.9461 x224 (Eastern time)
Cell: 819.210.6481
denis.roy@xxxxxxxxxxx

Back to the top