PBS notes

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Intro

Portable Batch System handles (batch) job queues - scheduling, monitoring and reporting progress, used for distributed batch jobs on clusters with dozens of nodes or more.

As of this writing there are two PBS main implementations:

torque — a fork of OpenPBS [1]

(OpenPBS was the original open source version, not actively developed anymore [2], forked and had to use a different name)

PBS Professional, a.k.a. PBSPro, is paid-for software [3]

There are three main components of a PBS / torque:

pbs_mom - runs on every compute node, basically to report resource use

pbs_server (what they report to and you interact with)

a server will also have a list of compute nodes that it can contact

pbs_sched (resource-aware scheduler)

The default scheduler (internally called fifo, though it's cleverer than that suggests) can be replaced. From this perspective,

PBS is the manages the queue and the compute resources,

Maui is the scheduler, based on the above (job and node information), and can opt to implement fancy policies, priorities, reservations, and whatnot.

The most common external scheduler seems to be Maui (free), apparently more common than the default(verify)

There's also Moab (paid-for)

And a bunch of more custom things people have written.

Install

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: $TORQUEHOME is often something like /var/spool/torque/ or /var/lib/torque

Before you start: If you have more than a few nodes, consider network-booting them (probably from the head node). This makes it much easier to change the configuration and have a node reboot effectively be a rollout.

Before a cluster setup, you may want to do a successful single-host install, just to get comfortable with the basic concepts, and get the distro-specific details out of the way.

See e.g. https://www.discngine.com/blog/2014/6/27/install-torque-on-a-single-node-centos-64

Head node install

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

set up trqauthd (authorizes connections to pbs_server)

set up pbs_server (the thing many commands like qsub, qstat, qmgr and so connect to. which compute notes to use, directs pbs_mom on each of them(verify))

set up pbs_sched

edit the nodes file, $TORQUEHOMEserver_priv/nodes, which tells the server which hosts report / can be used for work, and its resources (mainly cores) are listed as usable.

can be manually edited, or changed through qmgr

http://www.clusterresources.com/torquedocs21/1.5nodeconfig.shtml

run torque.setup (before starting pbs_server as a service. There are more manual ways of doing this, and a few details related to hostname and FQDN)

start trqauthd, pbs_server, and pbs_sched, and configure them to start at boot

Note that if you want to use Maui/Moab rather than a simple scheduler, this is the time. See e.g. http://docs.adaptivecomputing.com/mwm/7-0/Content/pbsintegration.html

compute nodes install

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

set up pbs_mom (runs on each node, maintains job state, does the monitoring, does some init and cleanup)

edit $TORQUEHOME/server_name

edit $TORQUEHOME/mom_priv/config on each node
- most importantly $pbsserver node1 telling it where pbs_server is running
- also often $logeven 255 which is a bitmap that tells it to log everything it can (handy for debug, see also $loglevel)

start it (and configure to start at boot)

test

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

pbsnodes -a
# you should see a list of nodes, which ideally have state = free, not down

echo "date" | qsub
# you should see a STDIN.[eo]*

echo "sleep 30" | qsub ; sleep 2 ; qstat
# you should see a list including this running (status=R) job and the date one as C (recently completed)
# if it's waiting to be queued, you forgot to install the scheduler component

mom.layout

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

On NUMA systems, a single MOM process will typically report multiple MOM nodes, one for what works out as each sub-computer (even the correct terminology gets confusing here).

...because NUMA means different CPU nodes, different memory controllers, different latency to different memory, and different bus speeds depending on which CPU core you're scheduling between. Think blade systems -- but also just dual-Xeon computers.

You can decide to make it pretend to be a single node -- and this is frequently good enough on e.g. a dual Xeon.

If you know it does matter to the efficiency of your job (e.g. blade servers share a backplane that tends to be faster than QPI, but will be the bottleneck if there are many nodes that need to continuously talk), you should configure the physical layout for each so that jobs can sit on individual NUMA nodes.

Basically /sys/devices/system/node/ shows you how things are enumerated, and what belongs to what, and there are scripts to help you convert that

Jobs

Given environment

A job inherits environment from the queue management when it is run, including:

PBS_JOBID - job identifier that was assigned

often sequentially numbered, so an implicitly unique temporary directory name

PBS_O_HOST - hostname that ran qsub
PBS_O_WORKDIR - directory from which the job was submitted (an absolute path)

logs go here(verify)

PBS_SERVER - hostname that runs pbs_server submitted to

PBS_O_QUEUE - queuename that was submitted to
PBS_QUEUE - queuename that is being executed in (usually the same)

PBS_JOBNAME - user-supplied job name

PBS_JOBDIR - path job is staged and executed in

PBS_ENVIRONMENT - execution mode; "PBS_INTERACTIVE" if interactive via -I, "PBS_BATCH" otherwise

PBS_NODEFILE - file containing nodes assigned to this job

PBS_O_PATH - PATH from submission environment

Controlling further environment

shell environment:

have a shell file that .pbs files will source.

may be the most controlled and admin-healthy

-v namelist - environment variable to be exported from qsub environment to job environment (variable), or defined there (variable=value)

-V requests that qsub copy all of its environment to the job's enviroment. Be careful with this.

Similarly, when working directory matters, the most controlled way is to do it yourself. The given environment variables (see above) help here.

Management and status

qsub your_script

submits a job

often a script, e.g. one that starts an MPI job.

You can hand in resource requests to qsub, or, more commonly, have them read from comments in the script file.

qstat (torque)

list jobs in queue

by default not so verbose, though has XML output (qstat -x) that can be useful for tooling

the most useful job states:

Q, queued, eligible to run

H, held - (usually means dependency on another job, but can have other reasons)

Q, waiting, for its execution time

R, running

E, exiting after its run

C, completed after having run (sticks around for a few minutes, just meant to report stuff that recently finished)

qstat docs

pbsnodes (torque)

gives the status of individual nodes, including what job(-id)s they are running

pbsnodes docs

it also has XML output

showq (Maui)

overview of queues, jobs, how much of the (known) cluster is occupied

qdel

remove job you've given up on

most commonly with the job ID

qdel docs

TODO: read:

Planning - job properties, resource requests

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Most of these can be specified on the qsub, e.g.

qsub -l walltime=10:00 myscript.pbs

or can be parsed out of comments in the script you hand to qsub, e.g.

#!/bin/bash
#PBS -l walltime=10:00
do_stuff

-l resource details

Mostly reserved/guaranteed resources, which will also hold back a job in the queue until this can be met, so most useful as a "don't run on a node that cannot easily deal." or "don't run until I can have a lot of nodes." Some specs can be platform specific. See man pbs_resources for detail.

Some of these are restrictions which you promise your work will observe/ Some only as an indicator for people watching the queue, others as an "assume it's gone haywire and kill it automatically" (so always overestimate this a bit).

Examples:

an amount of nodes (nodes=3), amount of processors per node (ppn=12) (default is 1 CPU on 1 node)

host(s) by name (e.g. if you know it has favourable resources)

host that have:
- arch - an architecture
- disk -
- maximum size of single file (file=)
- host - ...a certain name
- memory:
  - pmem - free physical memory (maximum we would use)
  - pvmem - process's virtual memory (maximum we would use)
  - vmem - collective virtual memory (maximum we would use)
- time =hh:mm:ss
  - walltime - wallclock time
  - cput - CPU time (collective)
  - pcput - CPU time (single process)
- a certain interconnect (e.g. infiniband or hippi)
- specific software installed
- a certain architecture

-W additional job properties - you can e.g. ask jobs to start after another finishes, to start at the same time, and such.

For example, to queue two jobs, and have the second wait for the first to finish (successfully or with an error)

# qsub first_job
7058.master
# qsub -W depend=afterany:7058.master  second_job

-a

execute only after a particular date'n'time

-q destinationqueue
- queue (named queue on default server)
- @server (default queue on named server)
- queue@server (named queue on named server)

-N jobname

some restrictions, because this

-p priority (-1024..1023, default is 0), controls which jobs to favour within a queue

Execution environment

-w working dir (sets $PBS_O_WORKDIR)

-u allows you to control under which user a job is run

can be specified per host

defaults to user running qsub

Changing your mind

For example, if you discover your job will run longer than you thought:

qalter jobid -l walltime=24:00:00

...though increasing the allocated time may be disabled for fairness, in which case overestimating beforehand is a better idea.

Node management

Interesting commands: pbsnodes and qmgr

List nodes with interesting states (e.g. down, offline): pbsnodes -l

Mark as offline: pbsnodes -o nodename

Return to service:

Nicely: pbsnodes -r nodename (clears offline, marks down, then marks as free only when it pings fine)
Forcefully: pbsnodes -c nodename

You can add a note on nodes, most useful when they are offline, down or unknown

Node states include:

offline - marked as unavailable for jobs

down - doesn't seem to be reachable

up states:
- job-execlusive
- job-sharing
- reserve
- free - available for jobs
- busy
- time-shared

state-unknown - e.g. happens when there are communication errors. (seen with down?)

Additionally valid when filtering:

up (combination of: job-execlusive, job-sharing, reserve, free, busy, time-shared)
active (combination of: job-exclusive, job-sharing, busy)
all

Unsorted

It seems that jobs will sometimes be shown in showq as using fewer processors than it really is. I haven't figured out why yet.

problems

Invalid credential MSG=Hosts do not match

...probably due to the way the trqauthd daemon works, which works with the value it reads from server_name.

Specifically, if it's different from the FQDN, auth will fail.

The typical fix is to edit server_name to reflect the FQDN and restart trqauthd.

This may require doing your name resolution differently.

Server has no node list MSG=node list is empty - check 'server_priv/nodes' file

That.

Edit it and restart pbs_server.

May be due to name resolution issues (verify)

Node nodename isn't declared to be NUMA, but mom is reporting

The node has multiple physical CPUs (e.g. dual Xeon, blade, etc), and pbs_server isn't configured to see it as such.

You probably want to check that that was intentional.

pbsnodes shows state = down

Nothing ever queued

qdel: Server could not connect to MOM

This error usually means the master can't contact the governing process on the compute node (node is down, there is a networking problem, or such), so it cannot request it to stop.

If you think it is a transient problem, wait and retry.

If you know the job really is completely gone from the node, e.g. because this (or all) the nodes rebooted, you can do (qdel -p jobid) - this basically clears the master's knowledge of what is happening on the nodes.

Note: If there are are active nodes with processes working, they will happily run on without the queue manager knowing about it. At best these nodes will be slower, at worst your jobs will fail.

ssh errors in PBS epilogue

You can check what your specific epilogue script is doing by finding it and reading it.

If it's the standard script then it'll be the "Killing leftovers..." bit, which for these purposes is just running a command on nodes using ssh. (from a script and non-interactively, which can help cause this. Debugging may be interesting as ssh and (pbs)psh may act differently)

In our case, the problem was that root did not have a keypair set up (the netboot image lacked a /root/.ssh directory completely, in fact).

There are subtler reasons.

PBS notes

Contents

Intro

Install

Head node install

compute nodes install

test

mom.layout

Jobs

Given environment

Controlling further environment

Management and status

Planning - job properties, resource requests

Changing your mind

Node management

See also

Unsorted

problems

Invalid credential MSG=Hosts do not match

Server has no node list MSG=node list is empty - check 'server_priv/nodes' file

Node nodename isn't declared to be NUMA, but mom is reporting

pbsnodes shows state = down

Nothing ever queued

qdel: Server could not connect to MOM

ssh errors in PBS epilogue

Navigation menu