David, thanks for the Linux 101. Let's graduate to
Linux 102 -- other bottlenecks. The build server CPU is only one
of four potential reasons a build may be slow:
Reason 2: RAM bottleneck
If everyone maintains a 2GB Java process, the build server will a) have
less memory for disk cache and b) eventually begin to swap. With top,
you can determine available memory by looking at these two lines:
Mem: 15924728k total, 15422960k used, 501768k free, 599732k buffers
Swap: 4196344k total, 180k used, 4196164k free, 9080944k cached
In this case, although there is 16G "used" and only 500M "free", there
is 9.08G allocated to file cache. This is good.
Reason 3: Disk bottleneck
When you run a build, you're using all kinds of disk resources: CVS/SVN
disks, workspace disks, download.eclipse.org disks, /shared (or temp)
disks, etc. The 'nice' command is not aware of how busy any particular
disk subsystem is, so even if you nice +19, you are still using many
disk resources.
If you're using top, you can determine how busy the disks are
by looking at these two clues:
Cpu(s): 22.9%us, 2.9%sy, 0.0%ni, 68.6%id, 0.9%wa, 0.1%hi, 0.3%si, 4.2%st
CPU time spent in IO Wait--------------------^^^^^^
16944 hudsonbu 17 0 818m 23m 6076 D 40 0.2 0:00.81 /opt/public/common/ibm-java2-p
Look here -----------------------------^
Processes in "D" state are completely blocked, waiting for I/O <-- bad
Processes in "R" state are Running, and using CPU cycles
Processes in "S" state are in interruptible Sleep
The higher the %wa value is, the more the build
server's CPUs are wasting their time waiting for I/O. In this case,
since our Gigabit LAN is far from saturated, you can be assured that
the IO Wait is related to one (or more) disk subsystems.
Reason 4: Network bottleneck
Since build.eclipse.org shares a Gigabit switch with everything
else at Eclipse.org, our internal network is not a source of
bottleneck. Yet.
This concludes today's Linux 101 lesson. There won't be a quiz.
Denis
On 03/17/2010 05:00 PM, David M Williams wrote:
> Ah, so wtpBuild is in fact the only one who is
nice
....
Thanks for pointing this out ... we
will strive to live up to community norms and end this aberrant
behavior
at once!
:)
Well, we are probably leading the
way
because we were (are?) leading the way in hogging the build machine in
the first place. Do let me know if you see "wtpBuild" misbehaving.
More constructively, for a little
Linux
101, when I started using 'nice' I had a very hard time getting all the
"commands" and "arguments" to pass through as expected
but finally discovered the magic arguments string ("$@") ...
it is somehow treated "special" by the interpreter and ends up
"reconstructing" the right arguments with various spaces and
quotes all preserved as intuitively expected.
In our "runAnt" script, I
end with
exec nice --adjustment 15
"${directory_variable}/ant.sh"
"$@"
It runs the ant script (as before)
but
at the lower priority (and anything ant spawns is at that same lower
priority).
We actually run our "build server" at normal priority, but it
doesn't do much and (most) jobs it kicks off are at the lower priority.
In case anyone finds that helpful.
I'd strongly urge all "tests"
(at least) to be ran at lower priority like +15 (yes, higher numbers
mean
lower priority ... increasing niceness to others, I guess). I settled
on
"15" because anything lower (like 5, 10) didn't seem to make
any difference at all, and things higher (e.g. 20) seemed to make a
really
noticeable difference. If you're worried you'll run too slow, under
"normal"
load, we still complete in the same time ... but, under heavy load, we
take maybe 25% longer, which is the way it should be (especially for
"tests").
Off hand, I'd say anything that takes over 60 minutes to complete
should be ran at lower priority, and let those little 10-20 minute jobs
(still) finish quickly. [All based on informal observations ... I'm
sure
others might have different, better advice.]
HTH
Ah, so wtpBuild is in fact the only one who is nice
(aside
from the jarsigner) ?
- thomas
On 03/17/2010 06:51 PM, Denis Roy wrote:
You're seeing "15" as the nice value, not -15.
Only root can lower the nice value beyond zero.
Denis
On 03/17/2010 01:38 PM, Thomas Hallgren wrote:
While I'm at it, I should also complain about this:
21771 wtpBuild 30 15 583m 89m 7408 S 145
0.6 0:23.77 java
Perhaps wtpBuild was inadvertently started with a negative nice value,
i.e., nice -n -15? The effect of that is that it's not so nice :-). It
tries to steal all resources that are available.
- thomas
On 03/17/2010 06:32 PM, Thomas Hallgren wrote:
Over the last couple of weeks, I've done a 'top' from
time to time when I feel that my builds take longer then they should.
Very
often, I see this at the top:
12252 egwin 20 0 668m 168m 11m S 115
1.1 0:09.27 java
A hint to egwin and others that run very heavy builds. The message of
the
day on the build machine states:
"If you run continuous builds, you should start your shell processes
with nice -n 10 (command) to be kind to others."
Another entry that isn't that uncommon at the top is:
29702 hudsonbu 17 0 584m 101m 7372 R 124 0.7
0:25.09 java
which of course raises the question, why isn't Hudson running its jobs
with nice -10?
The jarsigner seems to be one of the few that actually does this, and
my
builds always waits _very_ long times for it to complete.
Regards,
Thomas Hallgren
_______________________________________________
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/cross-project-issues-dev
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/cross-project-issues-dev
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/cross-project-issues-dev
|