|COSMOS for high-performance computing [message #570658]
||Thu, 15 November 2007 00:07
Originally posted by: randal.lanl.gov|
I lead a project at Los Alamos National Laboratory charged with revamping
(replacing) our current monitoring infrastructure for HPC systems. Our
environment is several 1000-10000 node Linux clusters, and our definition
of monitoring is real-time alerting, system event investigation, and
reporting of system interrupts in some detail.
Our requirements documentation identifies several concepts in common with
the COSMOS project--the importance of a system model, for instance.
However, we're having trouble pulling out the details from current
documentation, and the June, 2008, general release date is problematic for
We're currently talking with GroundWork and Zenoss (only one of whom seems
to be involved with COSMOS) about our extension of one of their
infrastructures to meet our needs. Is COSMOS release 0.4 something we
should consider as a basis for a project that needs to provide software
used in a production HPC environment, or should we just not spend the time?
Regardless of the answer to that question, what is the proper mechanism
for understanding the core COSMOS principles (other than what we glean
from the eclipse site)? For instance, is an HPC environment an eventual
potential target, or is the focus on networks and application servers?
There seem to be some biases (each piece of data is atomically relevant,
with little room for higher-level correlations, for instance) in all the
products/infrastructures we've surveyed, and I personally would like to
understand whether the biases are real or if we just misunderstand some
Powered by FUDForum
. Page generated in 0.06646 seconds