[ptp-dev] Runtime System Status Update

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[ptp-dev] Runtime System Status Update

From: Nathan DeBardeleben <ndebard@xxxxxxxx>
Date: Wed, 20 Jul 2005 14:32:48 -0600
Delivered-to: ptp-dev@xxxxxxxxxxx
List-archive: <http://eclipse.org/pipermail/ptp-dev>
List-help: <mailto:ptp-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/ptp-dev>, <mailto:ptp-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/ptp-dev>, <mailto:ptp-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mozilla Thunderbird 1.0.2 (Macintosh/20050317)

I committed some code earlier which brought me up to this status point.The changelog includes some of these comments but I'm adding some morehere about what needs to be done next.

Where we are: You can now run OMPI jobs through PTP. When the jobstarts we get notified of the job identifier which I then use topopulate the runtime model - signifying that our Universe knows of onenew Job. I have changed the Job viewer - it now lists the jobs on theleft and, when you click on one, it lists some statistics of the Job onthe right. The information right now is minimal but we'll obviously addmore in the future. You can also click on a job and then click theTerminate icon to send a kill message to the Control System to kill thatjob. This identifies the job and sends it down to the subsystemcorrectly - however, at this time I'm not actually doing a kill in OMPI.

What we aren't: There's a few immediate problems and things to do.

   * I am capturing the correct jobID at the control layer but am not
     actually killing the job yet with OMPI.  I believe I know how to
     do this, so this should be easy and likely the next step.
   * I can't yet get information about the Job that I've just started
     except it's JobID.  I believe I know how to get some basic
     information, but what I really need is the ID of each process that
     started, and then for each process I need to know the ID of the
     Node which it is running on.  It's unclear to me at this time if
     this is just something I don't know how to do yet or if it's
     unimplemented in the ORTE.  I'll be determining this in the near
     future.
   * (BIGGEST PROBLEM/ANNOYANCE) I can start the ORTEdaemon from PTP
     through the JNI interface and I can also tell it to cleanup itself
     and stop.  When I do this I can see the process start with the
     correct args and when I stop it I can see it cleanup itself
     perfectly.  Once it's started I can communicate with it - talking
     to the registry.  However, I cannot spawn an MPI job.  What's
     worse is that I also cannot spawn an MPI job at the command-line
     if I've started the daemon in this way.  I've tried several exec()
     calls as well as the cheesy 'system()' function - all with the
     same results.  The MPI programs just immediately return.  If I
     debug the program I can see that it gets assigned a JobID by ORTE
     but ORTE immediately sends an ABORT and TERMINATED message to me,
     saying the job is done.  I don't believe the job ever starts, frankly.
   * Greg has written some code that does state of health monitoring on
     bproc.  I need to integrate this with the JNI library and,
     subsequently, into the UI to make the UI actually reflect the
     cluster it sees.

What can you do: You can test these improvements if you want. Sadly,you have to start the orted on the console before you run the PTP. Ifyou do this, it works perfectly. You can start MPI jobs all day,concurrently, watch the JobViewer fill up, look at the messages comingout of the processes' stdout/stderr. You can pretend to kill a job andnotice in the print statements that it finds the right job. You canplay around with the preferences and configure the ORTEd, and if you dothings wrong or get errors from the OMPI/JNI layer you'll even get nicehelpful popup error boxes.


--
-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard@xxxxxxxx
---------------------------------------------------------------------

Follow-Ups:
- Re: [ptp-dev] Runtime System Status Update
  - From: Greg Watson

Prev by Date: [ptp-dev] Updated Parallel Debug View
Next by Date: Re: [ptp-dev] Runtime System Status Update
Previous by thread: [ptp-dev] Updated Parallel Debug View
Next by thread: Re: [ptp-dev] Runtime System Status Update
Index(es):
- Date
- Thread

Breadcrumbs