Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [ptp-dev] Using PTP with SLURM on a BlueGene/P


 Simon,
 
By now, all options except ''-mapfile" are clear.
 
With SLURM, user can't know which nodes would be allocated on job request,
except with the "-w" option of srun,  which specifies the nodes that must be included.
Only with this option, user can know the will-allocated nodes in advance and thus prepare the mapfile.
But this will causes some side-effects, that is, even if there are enough idle nodes for job request,
but the specified nodes are already allocated to other jobs, the user job will still have to be PENDING
to wait for those nodes.
 
Regards,
Jie

To: yangtzj@xxxxxxxxxxx
Subject: RE: [ptp-dev] Using PTP with SLURM on a BlueGene/P
From: simon.wail@xxxxxxxxxxx
Date: Fri, 16 Jul 2010 10:23:20 +1000

Jie,
You are absolutely correct that the number of nodes (-N) could be calculated by the number of tasks (-np) and dividing it based on the specified mode.  Therefore we could just use the two options (-np, -mode) in the SLURM RM configuration.

The -mapfile is VERY important for users.  Unlike a cluster, with the BG, the nodes you are allocated by SLURM are always adjacent, and the BG has a torus network that connects nearest neighbour nodes.  Therefore how your problem domain is distributed to the nodes is very important and can have a dramatic affect on performance.  That is what the -mapfile option is used for.

We are using V2.1.x of SLURM.  Currently this is 2.1.9 and the BG will remain on the 2.1.x branch for the foreseeable future.

Regards,
Simon Wail, Ph.D
HPC Specialist
IBM Research Collaboratory for Life Sciences - Melbourne


phone:
+61 3 9035-4341  fax: +61 3 8344-9130
address:
VLSCI, Gnd Floor, 187 Grattan St
Carlton   VIC   3010   Australia
email:
simon.wail@xxxxxxxxxxx





From: JiangJie <yangtzj@xxxxxxxxxxx>
To: Simon Wail/Australia/IBM@IBMAU
Cc: <ptp-dev@xxxxxxxxxxx>
Date: 15/07/2010 17:55
Subject: RE: [ptp-dev] Using PTP with SLURM on a BlueGene/P






Simon,

The task number (-np ) of an MPI job  is determined by user.
And  the number of nodes required should be determined by both the task number
and the "mode". Say, if task number is 8, the required node number should be 8, 4,
and 2 for SMP mode, DUAL mode, and VN mode separately. Right?
So given -np and -mode options, the required node number can be calculated.
And the "-N" option of srun means the minimum number of nodes required.
The value of N should be larger than or equal to np/mode.

Is the "-mapfile" option important for users?
I mean, the user is not usually clear about the nodes allocated by SLURM,
so it is very possible that user can't give the correct mapfile when submiting job.

At last, and very important, which verison of SLURM are you using?
Current ptp slurm proxy is implemented based on SLURM-2.2/2.1 API.
And previous version of SLURM is not supported anymore because of
large amount of difference of API.  If your SLURM version is 1.x, it would be impossible
to support it on the base of current implementation and a new implementation would be required.

Regards,
Jie

To: yangtzj@xxxxxxxxxxx
Subject: RE: [ptp-dev] Using PTP with SLURM on a BlueGene/P
From: simon.wail@xxxxxxxxxxx
Date: Thu, 15 Jul 2010 13:55:43 +1000

Jie,

I've done some further investigation of SLURM with BG and you are right - the "-N x" option should be passed to srun to allocate the necessary nodes.  It also supports the "-n x" option in a weird way.  Each BG processor has 4 cores and can run a maximum of 4 MPI tasks.  Therefore when using the -n option, the value is divided by 4 to determine the number of nodes/processors to allocate.  But this might not be how you want to run your program.  You might want to only have one MPI task per node (SMP mode - the default), or 2 (DUAL mode) or 4 (VN mode).  If using anything other than SMP mode, then the -mode option must be passed to mpirun, and the -np option is also required if you want to run less MPI tasks th! an the maximum allowed for the different modes on the available nodes - i.e SMP = 1 x nodes, DUAL = 2 x nodes, VN = 4 x nodes.  Therefore it might make the most sense for the BG SLURM RM to not use the "-n" option (as currently available), but use the existing"-N" option, and add the "-mode", and "-np" options.


I believe SLURM works the same way on both the BG/P and the older BG/L.  I also believe mpirun works similarly on both systems, except BG/L only has 2 modes - DUAL and VN as it only has 2 cores per node.  I imagine any SLURM RM interface we build could be applicable to both BG models.  As to future BG systems, we can't be sure until they are available!


Regards,
Simon Wail, Ph.D
HPC Specialist
IBM Research Collaboratory for Life Sciences - Melbourne


phone:
+61 3 9035-4341  fax: +61 3 8344-9130
address:
VLSCI, Gnd Floor, 187 Grattan St
Carlton   VIC   3010   Australia
email:
simon.wail@xxxxxxxxxxx





From: JiangJie <yangtzj@xxxxxxxxxxx>
To: Simon Wail/Australia/IBM@IBMAU
Date: 15/07/2010 00:39
Subject: RE: [ptp-dev] Using PTP with SLURM on a BlueGene/P







Simon,

In SLURM, the job launch process actually contains two steps:
1. allocate node resources;
2. spawn job step.

Since the ptp_slurm_proxy is completely implemented via SLURM API,
it firstly requires that the number of requested nodes. However,
from your example,  it can't infer how many nodes should be allocated.
Maybe you omit the "-N x" option of srun?
That is, the submitting command should be "srun -N x mpirun -np xx -exe ./a.out".
I'm not very sure about this.
After allocating node resources, the afterward work is similar,
instead of replacing executable with mpirun.
And the options of mpirun can be handled easily.

Do the job submitting command apply to all BG machines with SLURM rms?
I mean it would be better that our implementation shouldn't be bound to a specific machine.

Regards,
Jie


To: ptp-dev@xxxxxxxxxxx
CC: yangtzj@xxxxxxxxxxx
Subject: RE: [ptp-dev] U! sing PTP with SLURM on a BlueGene/P
From: simon.wail@xxxxxxxxxxx
Date: Wed, 14 Jul 2010 11:29:50 +1000


Jie,

When submitting an MPI job on the BG with SLURM, you need to use the BG version of mpirun as your executable.  Therefore it would look like:


srun mpirun -np x -exe ./a.out


The -np option is the number of processors on which to run, -exe specifies the executable (or it can just be at the end with its arguments) and there are many other options to mpirun on the BG.  For example:


-mode SMP | DUAL | VN - option t! o specify how many MPI tasks per node (SMP = 1, DUAL = 2, VN = 4 - the BG node has 4 cores)

-mapfile <map file | mapping> - option to specify how MPI tasks are assigned to nodes in the BG torus network

-exe <executable> - sets MPI program to run

-args "<program arguments> - sets arguments to MPI program

-cwd <dir> - sets current working directory

-env <exp=val> - sets environment variable

And there are many more.  Some of these would need to be supported in the SLURM RM user interface - particularly -mode ! and -mapfile, while others, such as -env, -exe, -args, -cwd co! uld be implemented by the proxy code.  Others might not be needed at all.


The BG also supports mpiexec and allows different binaries to be executed on different groups of nodes.  I'm not sure if we need to support this option as it probably goes beyond the scope of what PTP can handle.


Let me know what you think and how we should proceed from here.


Regards,
Simon Wail, Ph.D
HPC Specialist
IBM Research Collaboratory for Life Sciences - Melbourne


phone:
+61 3 9035-4341  fax: +61 3 8344-9130
address:
VLSCI, Gnd Floor, 187 Grattan St
Carlton   VIC   3010   Australia
email:
simon.wail@xxxxxxxxxxx





From: JiangJie <yangtzj@xxxxxxxxxxx>
To: <ptp-dev@xxxxxxxxxxx>
Cc: Simon Wail/Australia/IBM@IBMAU
Date: 13/07/2010 18:56
Subject: RE: [ptp-dev] Using PTP with SLURM on a BlueGene/P







Hi Simon,

Sorry for the delayed response.
It would be great to port SLURM proxy to support BG machine.

Since I have no BG machine available, could you please give an example on
the command submitting MPI job on BG machine with SLURM rms?
On non-BG machine, it usually looks like this:
srun -n x -N x ./a.out
What is it on BG machine?

Regards,
Jie


To: ptp-dev@xxxxxxxxxxx
From: simon.wail@xxxxxxxxxxx
Date: Mon, 28 Jun 2010 14:34:22 +1000
Subject: [ptp-dev] Using PTP with SLURM on a BlueGene/P


I'm using the latest version of Eclipse (Helios) with PTP 4.0.  I've been able to configure a resource manager talking to SLURM, that is simulating an IBM BlueGene system.  The issue I have is it seems SLURM on the BlueGene is different than on other systems.  On other systems when you provide SLURM with the number of nodes and/or tasks to run on, it automatically executes your MPI program on the number of nodes specified.  For a BlueGene it is different as you need to execute the "mpirun" command with your MPI program as an argument, as well as specify the number of nodes as the "-np" argument to mpirun.  Now of course the SLURM implementation for PTP does not do this.

So the question is how do we get the SLURM resource m! anager to work for BlueGene? &nbs! p;I've looked at the PTP-SLURM proxy code and it seems one way would be to change the executable for the job to always be "mpirun" and the user specified program is added as an argument.  This seems a bit of a hack, and it doesn't account for other options to mpirun needed for the BlueGene, such as "mode" and processor mapping.  I suppose a better method would be to change/extend the existing SLURM resource manager configuration in the UI to allow the specification of the mpirun command if you're using a BlueGene - like the PBS configuration allows the user to select the MPI command.  Alternatively we could create a new SLURM resource manager specifically for BlueGene but probably reusing lots of the existing SLURM code.  Either approach is probably a lot more work and would need some proper design specs.  Is there currently any plan to add BlueGene support to the SLURM RM, or has anyone tried it?  Am I on! my own here, or does this sound like a! reasonable enhancemen! t to PTP and maybe something we can work! on together?


Your feedback is much appreciated.


Thanks,
Simon Wail, Ph.D
HPC Specialist
IBM Research Collaboratory for Life Sciences - Melbourne


phone:
+61 3 9035-4341  fax: +61 3 8344-9130
address:
VLSCI, Gnd Floor, 187 Grattan St
Carlton   VIC   3010   Australia
email:
simon.wail@xxxxxxxxxxx







使用Messenger保护盾V2,支持多账号登录! 现在就下载!



聊天+搜索+邮箱 想要轻松出游,手机MSN帮你搞定! 立刻下载!



聊天+搜索+邮箱 想要轻松出游,手机MSN帮你搞定! 立刻下载!




更多热辣资讯尽在新版MSN首页! 立刻访问!

Back to the top