Using PBS at the CAC

CAC resources are available to the members of the University of Michigan, however preferred priority is given to users or groups if they have contributed to the CAC (such as purchasing machines or time on those machines). Please contact the CAC to arrange a temporary special dispensation if you have a deadline. For a full explanation of our queuing policies, see our policy section.

An important consideration when creating your PBS script is file input and output. If you reference your input files a lot, it is worth your time to add commands to your PBS script to create a local directory in /tmp (mkdir /tmp/$PBS_JOBID) and copy your files over to that directory before starting your program. PBS collects STDOUT and STDERR from your program and we highly recommend letting PBS take care of your output and not redirecting it to your home directory; when PBS handles the output, it writes yours files to the local disk while it is working, rather than the remote disk mounted on /home; the files are copied to your home directory at the end of the job. Writing to the local disk will improve the performance of your program (/home is accessed over the network, which is almost always slower than files on the disk in the compute node), but it does mean that you cannot see the output from your program while it is running. Also, if there is a problem with the remote file system while your program is running, it will be unaffected and will continue to run. With parallel programs, keep in mind that the local disk is local to each compute node, so if each task needs to read and write files, you need to distribute them to or gather them from all of the compute nodes as necessary. Finally, if your program uses scratch files, it is very worthwhile to set the scratch directory to a local disk (with parallel programs make sure that the data isn't shared, or this won't work).

The Queues

In general, on the different clusters you should submit to the route queue - this will feed your job into the most appropriate queue so you don't have to worry about queues changing definition. However, if you want to submit to a specific queue, you should look at the output of qstat -q to see what the queue definitions and limits are.

Resources

You can request specific attributes, such as number of nodes, memory or job runtime. Memory requests should be set per process using the pmem. Set the walltime to a number close but slightly longer than you expect the job to run. To request 2 nodes of 2 processors each, each process using a max of 2gb of memory for 1 hour:
#PBS -l nodes=2:ppn=2,pmem=2000mb,walltime=1:00:00

To request memory, you should request memory in mb rather than gb, as well as making 1000mb=1gb. As an example for our 2gb nodes, they actually only have available after system usage maybe 2025mb. A request for 2gb==2048mb will never be honored. In a nutshell, round down a little for your memory approximations.

You can see all other attributes by reading the pbs_resources man page

 

The Importance of Estimating Your Job's Runtime

You need to estimate how long your job will run. If you do not estimate the wall clock time required by your run, (e.g. walltime=45:00), PBS will terminate your job after 15 minutes. However, if you specify an excessively long runtime, your job may be delayed in the queue longer than it should be. Therefore, please attempt to accurately estimate your wall clock runtime. (A modest amount of overestimation (10-20%) is probably ideal).

How to Write a PBS Batch Script

PBS scripts are rather simple. An MPI example for user your-user-name (using 14 processes):

Example: MPI Code

#!/bin/sh
#PBS -S /bin/sh
#PBS -N your-mpi-job
#PBS -l nodes=7:ppn=2,walltime=1:00:00
#PBS -q route
#PBS -M your-email-address
#PBS -m abe
#
echo "I ran on:"
cat $PBS_NODEFILE
# Create a local directory to run and copy your files to local.
# Let PBS handle your output
mkdir /tmp/${PBS_JOBID}
cd /tmp/${PBS_JOBID}
cp ~/your_stuff .


# Use mpirun to run with 7 nodes for 1 hour
mpirun -np 14 ./your-mpi-program

cd
/bin/rm -rf /tmp/${PBS_JOBID}

 
The PBS script parameters are as follows:
#PBS -N your-mpi-job
     Name of the job in the queue is "your-mpi-job". This can be anything as long as it is less that 13 characters long; you should make it descriptive so you know which of your jobs are running and queued.
#PBS -l nodes=7:ppn=2,walltime=1:00:00,pmem=1gb
     Reserve 7 machines w/ 2 processors each (14 processors), each process using 1GB of memory, for 1 hour. Note the pmem is different than previous usage of mem.
#PBS -S /path/to/shell
     Script is /bin/sh (see below)
#PBS -q default
     Submit to the queue named default.
#PBS -M your-email-address
     Email me at this address.
#PBS -m abe
     Email me when the job aborts, begins, and ends.
#PBS -joe
     Join your stdout and stderr output into one file, to be placed in your home directory.

For complete information on PBS flags, use "man qsub". For further information on PBS, use "man pbs".

The MPI (mpirun) parameters are as follows:
-np    Number pf processes.
-stdin <filename>    Use "filename" as standard input.
-t   Test but do not execute.

Example: OpenMP Code

If you're running OpenMP code (w/ 1 or 2 processes on these machines):

#!/bin/sh
#PBS -S /bin/sh
#PBS -N your-openmp-job
#PBS -l nodes=1:ppn=2,walltime=90:00
#PBS -q route
#PBS -M your-email-address
#PBS -m abe
#
echo "I ran on:"
cat $PBS_NODEFILE
#
# Create a local directory to run and copy your files to local.
# Let PBS handle your output
mkdir /tmp/${PBS_JOBID}
cd /tmp/${PBS_JOBID}
cp ~/your_stuff .


./your-openmp-program

#Clean up your files
cd
/bin/rm -rf /tmp/${PBS_JOBID}

You may find it necessary to add the following to OpenMP jobs, should you run low on stack space due to the default stack size of 2 MB:

export MPSTKZ 8M

Example: Serial Code

If you have a serial code (e.g. octave) just set 'nodes=1'.
For example:

#PBS -N your-serial-job
#PBS -l nodes=1,walltime=24:00,pmem=1gb
#PBS -q route
#PBS -M your-email-address
#PBS -m abe
#
# Create a local directory to run and copy your files to local.
# Let PBS handle your output
mkdir /tmp/${PBS_JOBID}
cd /tmp/${PBS_JOBID}
cp ~/your_stuff .

octave < input.m > out.mat

#Clean up your files
cd
# Retrieve your output
cp /tmp/${PBS_JOBID}/* ~/your_stuff
/bin/rm -rf /tmp/${PBS_JOBID}

In this script, stdout and stderr will be directed into file JobName.o##. JobName was specified by the -N flag in the script file.

How to Submit a PBS Batch Script

To submit an PBS script simply type:
qsub your-scriptname

where your-scriptname is the name of your PBS script. Note that PBS runs your script under the your shell, unless otherwise told to do so. One benefit of running under /bin/sh is that csh is arguably broken in how it handles terminal-disconnected jobs (same goes for tcsh). Using csh or tcsh is fine, but you will receive error warnings at the beginning of your output file:

Warning: no access to stty (Bad file descriptor).
Thus no job control in this shell.

 

How to Check the Status of a PBS Batch Job

To check the status of your job in the queue, type:

qstat your-job-id

To see all jobs in the queue, type:

qstat -a

To see detailed info on each job, type:

qstat -f

To see the number of idle nodes in the queue: queue, type:

freenodes

Scheduling

We are using the maui scheduler to implement various scheduling requirements. To learn about the maui commands, read out Maui page.

How to Cancel a PBS Batch Job

If you realize that you made a mistake in your script file or if you've made modifications to your program since you submitted your job and you want to cancel your job, first get the "Job ID" by typing qstat. If you encounter an error while using qdel, add the -W force (if on aon, eliza or nyx) flag. If you can't delete your job on morpheus, send us email and we'll delete the job for you.

For example:

qdel [-W force] 203 - if running on aon, nyx or eliza

qdel 203 - if running on the morpheus cluster

How to Query the PBS Queues

To see the names of the available queues and their current parameters, type:
qstat -q

The notable parameters in the output are the queue names (in the Queue column) and the CPU time limits (in the Walltime column).

Queuing Policy

At the CAC we strive to promote equitable access to our resources. Because all jobs run on the CAC systems are submitted to a batch queuing system, we enforce this fairness by controlling several parameters to the scheduling algorithm used by the queuing system.

When a job is submitted to the queuing system, the queuing system looks for free nodes on which to run it. If it can't find any nodes that are suitable for your job, your job stays in the queued state (in PBS this is denoted by the letter "Q"). While your job is queued, its position in the queue is adjusted relative to the other jobs in the queue based on two primary factors: limits and priority.

In the general access partition on each cluster, we limit the number of nodes that any one person can use at a time, thought that number changes depending on the cluster. However, to get the maximum use out of the CAC systems, these limits are soft, so if no one else is waiting, it is possible for one person to use more nodes than the soft limit.

We also limit the number of jobs that are considered for scheduling. We will schedule 1 job per person at a time. This means that if USER-A submits 50 jobs with job IDs 101 through 150, and USER-B later submits a job with job ID 151, the scheduler will consider only jobs 101 and 151 for scheduling. When USER-A's job is started, her next job will become eligible for scheduling; while it is waiting, it is not accumulating priority.

To further promote fair use of the CAC resources, jobs in the queued state are ordered by their priority. The priority of a job is computed from several factors:

  • The amount of time the job has been in the queue; the longer the time, the greater the priority. However, only one of your jobs at a time accumulates priority based on how long it has been in the queue.
  • Your usage over the past 30 days. If you have used a large amount of wallclock time on the cluster in the past month, people who have used less will receive a higher priority. This is known as "fairshare" and attempts to insure that the widest possible range of users will have access to the CAC resources.

There are exceptions to these rules, the largest being that for people who have purchased nodes or dedicated time on the cluster the limits do not apply. Fairshare still applies to promote fair use within the group of people with access to the private set of nodes.

If you have questions about this policy or feel that it is not fairly enforced, please contact us at cac-support@umich.edu