[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
[ptp-dev] Runtime System Status Update
|
I committed some code earlier which brought me up to this status point.
The changelog includes some of these comments but I'm adding some more
here about what needs to be done next.
Where we are: You can now run OMPI jobs through PTP. When the job
starts we get notified of the job identifier which I then use to
populate the runtime model - signifying that our Universe knows of one
new Job. I have changed the Job viewer - it now lists the jobs on the
left and, when you click on one, it lists some statistics of the Job on
the right. The information right now is minimal but we'll obviously add
more in the future. You can also click on a job and then click the
Terminate icon to send a kill message to the Control System to kill that
job. This identifies the job and sends it down to the subsystem
correctly - however, at this time I'm not actually doing a kill in OMPI.
What we aren't: There's a few immediate problems and things to do.
* I am capturing the correct jobID at the control layer but am not
actually killing the job yet with OMPI. I believe I know how to
do this, so this should be easy and likely the next step.
* I can't yet get information about the Job that I've just started
except it's JobID. I believe I know how to get some basic
information, but what I really need is the ID of each process that
started, and then for each process I need to know the ID of the
Node which it is running on. It's unclear to me at this time if
this is just something I don't know how to do yet or if it's
unimplemented in the ORTE. I'll be determining this in the near
future.
* (BIGGEST PROBLEM/ANNOYANCE) I can start the ORTEdaemon from PTP
through the JNI interface and I can also tell it to cleanup itself
and stop. When I do this I can see the process start with the
correct args and when I stop it I can see it cleanup itself
perfectly. Once it's started I can communicate with it - talking
to the registry. However, I cannot spawn an MPI job. What's
worse is that I also cannot spawn an MPI job at the command-line
if I've started the daemon in this way. I've tried several exec()
calls as well as the cheesy 'system()' function - all with the
same results. The MPI programs just immediately return. If I
debug the program I can see that it gets assigned a JobID by ORTE
but ORTE immediately sends an ABORT and TERMINATED message to me,
saying the job is done. I don't believe the job ever starts, frankly.
* Greg has written some code that does state of health monitoring on
bproc. I need to integrate this with the JNI library and,
subsequently, into the UI to make the UI actually reflect the
cluster it sees.
What can you do: You can test these improvements if you want. Sadly,
you have to start the orted on the console before you run the PTP. If
you do this, it works perfectly. You can start MPI jobs all day,
concurrently, watch the JobViewer fill up, look at the messages coming
out of the processes' stdout/stderr. You can pretend to kill a job and
notice in the print statements that it finds the right job. You can
play around with the preferences and configure the ORTEd, and if you do
things wrong or get errors from the OMPI/JNI layer you'll even get nice
helpful popup error boxes.
--
-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard@xxxxxxxx
---------------------------------------------------------------------