Clustercomputing with Torque

This entry was posted by on Monday, 30 May, 2011 at

Running computation jobs on a cluster is great – it speeds up your work and reduces the temperature in your office. But you need to carefully craft your *.pbs file – or you will see some strange results. This post is meant as a reference / FAQ and will probably grow over time.

Intended users: users running OMNeT++ discrete event simulation experiments – but you can substitute any application.

Alternatively, you can use mpi to achieve similar results. Read about that here.

First: read how to get your code on the cluster.

Second: compile your code on the head node.

Third: Throw your job on the cluster to run (subject of the remainder of this text).

Fourth: Get your results. You may either opt for log files (.sca, .elog, .vec etc. are written to the networked filesystem) or have a look at how to output OMNeT++ results to a database.

Sample PBS:

1 #!
2 #PBS -j oe
3 #PBS -V
4 #PBS -m e
5 #PBS -l nodes=10:ppn=8:E5520
6 #PBS -l pmem=2gb
7 #PBS -l walltime=336:00:00
8 cd /sim
9 opp_runall -j80 ./sim -u Cmdenv -c CSMAtest -r 0..599

This does the following:

  1. #! – use the default shell. You could also specify #!/bin/bash, #!/bin/sh etc.
  2. -m tells how you want to be notified by mail. It has three options: a – notify when aborted, b – notify at beginning of job, e – notify after termination.
  3. requests 10 nodes (with E5520 CPUs) and use 8 cores per node
  4. request a maximum process (or thread) memory size of 2GiB
  5. Set a max wall time of 336 hours. Two weeks should be enough, but see below
  6. cd to the right directory
  7. execute the command (in this case, I spawn OMNeT++ processes, 80 at a time)

An alternative to line 5 could be:

#PBS -l nodes=10:ppn=8

This way you tell the Torque scheduler you don’t care on what type of CPU your jobs run. You can also explicitly run on specific machines. Say your nodes are called nX where X is a number [1..20]:

#PBS -l nodes=n14:ppn=4+n15:ppn=4+n20:ppn=8

Now you are using 4 cores on n14 and n15, and 8 on n20, so you can run 16 processes in total (for OMNeT++’s opp_runall, the -j value should not exceed the number of available cores or you’ll see decreased performance).

Submitting, deleting and status
Once you’ve got your PBS file, you can send it from the head node to the cluster like so:
qsub myJob.pbs
It will return a job identifier. You can then check the state of the queue:
qstat
Or get more info with the “-f” argument. It will initially start in the “Q” state: it is enqueued. Once the job is distributed on the cluster and the nodes start working, it goes to “R” (which, obviously, stands for “Running”).

You can get more info on the state of your job with:
checkjob [job identifier]

If you want to remove your job (heck, we all make mistakes!) that’s quite simple:
qdel [job identifier]
And qstat will show you the job is gone (or actually, it will *not* show the job). Note that the transition from Q to R may take some time (there may be other jobs on the cluster which need to finish first, but it also just takes time to get all the stuff in place). Removing a job will also take some time, as jobs will need to be killed.

Torque’s WallTime not always what you’d expect
Though ‘WallTime’ suggests it runs at a second per second – as does ye olde grandfather’s clock – our version of Torque sees WallTime as the time spent per core, summed over all cores – so if you set 14 days of WallTime, it will be over in 7 days if you use 2 cores. In fact, using 80 cores (and assuming no overhead) this should be done in 4.2 hours. This could be a configuration issue or a bug, I don’t know. I just encountered this issue and found a workaround. Note that Torque will terminate any job which exceeds the WallTime (but this will show up in the output which comes in the shape of [jobName].pbs.o[job identifier] and then you can fix the WallTime).

Torque’s memory setting
There are two ways to set the memory limit: ‘mem’ and ‘pmem’. Using ‘mem’ you set the limit for the job – so ‘mem=24gb’ means that the 80 processes together should not consume more than 24GiBs of memory. Note that you can easily reach this limit – it is only slightly over 300 megs per process! When you use the ‘pmem’ you specify the memory per process – so ‘pmem=2gb’ allocates 2GiBs per simulation process in the example PBS. Should be enough for this scenario.

References:
Some good Torque references:

ClusterResources.com

University of Florida High Performance Computing Center Wiki

Boston College Torque User Guide

Purdue physics cluster useful PBS commands by Preston M. Smith.

University of Michigan, Department of Computational Medicine and Bioinformatics and
Department of Biological Chemistry, Yang Zheng lab. A document on how to get and run jobs on a Torque/MOAB cluster.
http://zhanglab.ccmb.med.umich.edu/docs/cluster_doc2.html

5 Responses to “Clustercomputing with Torque”

  1. In the PBS you can specify you have a job which requires a certain number of CPUs. The opp_runall -j60 basically spawns 60 independent processes, so if one fails the other continue. These processes are distributed among the number of CPUs (note that the -j should not be larger than the number of CPUs specified in the PBS or you’ll get a lot of context switches which is detrimental to performance).

    An alternative is, as you mention, to create a number of PBS files. You could create one for each simulation run (then you do not need opp_runall). This may give the Torque/MAUI batch scheduler a finer granularity to work with: If I claim 60 CPUs for 60 processes, and there are 10 which take considerably longer than the other 50, I still have claimed these CPUs. That’s something to keep in mind on busy clusters.

  2. MIlos

    I am creating a number of .pbs files which I let cluster to load by qsub.. so if something fails I can stay cool because it does not affect other runs.. (I know that opp_run should do as well but..)

    One thing I am uncertain is the -j60 .. cause Torgue distributes .pbs not threads? so this 60 threads will run at one node of cluster or Torque is smart enough and will distribute threads itself?

Trackbacks/Pingbacks

  1. Freeminded.org
  2. Speeding up OMNeT++/MiXiM simulations @ Freeminded.org
  3. Moving files to a Linux box @ Freeminded.org

Leave a Reply