PBS notes
Intro
Portable Batch System handles (batch) job queues - scheduling, monitoring and reporting progress, used for distributed batch jobs on clusters with dozens of nodes or more.
As of this writing there are two PBS main implementations:
- torque — a fork of OpenPBS [1]
- (OpenPBS was the original open source version, not actively developed anymore [2], forked and had to use a different name)
- PBS Professional, a.k.a. PBSPro, is paid-for software [3]
There are three main components of a PBS / torque:
- pbs_mom - runs on every compute node, basically to report resource use
- pbs_server (what they report to and you interact with)
- a server will also have a list of compute nodes that it can contact
- pbs_sched (resource-aware scheduler)
The default scheduler (internally called fifo, though it's cleverer than that suggests) can be replaced. From this perspective,
- PBS is the manages the queue and the compute resources,
- Maui is the scheduler, based on the above (job and node information), and can opt to implement fancy policies, priorities, reservations, and whatnot.
The most common external scheduler seems to be Maui (free), apparently more common than the default(verify)
There's also Moab (paid-for)
And a bunch of more custom things people have written.
Install
Note: $TORQUEHOME is often something like /var/spool/torque/ or /var/lib/torque
Before you start: If you have more than a few nodes, consider network-booting them (probably from the head node). This makes it much easier to change the configuration and have a node reboot effectively be a rollout.
Before a cluster setup, you may want to do a successful single-host install, just to get comfortable with the basic concepts, and get the distro-specific details out of the way.
See e.g. https://www.discngine.com/blog/2014/6/27/install-torque-on-a-single-node-centos-64
Head node install
- set up trqauthd (authorizes connections to pbs_server)
- set up pbs_server (the thing many commands like qsub, qstat, qmgr and so connect to. which compute notes to use, directs pbs_mom on each of them(verify))
- set up pbs_sched
edit the nodes file, $TORQUEHOMEserver_priv/nodes, which tells the server which hosts report / can be used for work, and its resources (mainly cores) are listed as usable.
- can be manually edited, or changed through qmgr
run torque.setup (before starting pbs_server as a service. There are more manual ways of doing this, and a few details related to hostname and FQDN)
start trqauthd, pbs_server, and pbs_sched, and configure them to start at boot
Note that if you want to use Maui/Moab rather than a simple scheduler, this is the time. See e.g. http://docs.adaptivecomputing.com/mwm/7-0/Content/pbsintegration.html
compute nodes install
- set up pbs_mom (runs on each node, maintains job state, does the monitoring, does some init and cleanup)
- edit $TORQUEHOME/server_name
- edit $TORQUEHOME/mom_priv/config on each node
- most importantly $pbsserver node1 telling it where pbs_server is running
- also often $logeven 255 which is a bitmap that tells it to log everything it can (handy for debug, see also $loglevel)
- start it (and configure to start at boot)
test
pbsnodes -a
# you should see a list of nodes, which ideally have state = free, not down
echo "date" | qsub
# you should see a STDIN.[eo]*
echo "sleep 30" | qsub ; sleep 2 ; qstat
# you should see a list including this running (status=R) job and the date one as C (recently completed)
# if it's waiting to be queued, you forgot to install the scheduler component
mom.layout
On NUMA systems, a single MOM process will typically report multiple MOM nodes, one for what works out as each sub-computer (even the correct terminology gets confusing here).
...because NUMA means different CPU nodes, different memory controllers, different latency to different memory, and different bus speeds depending on which CPU core you're scheduling between. Think blade systems -- but also just dual-Xeon computers.
You can decide to make it pretend to be a single node -- and this is frequently good enough on e.g. a dual Xeon.
If you know it does matter to the efficiency of your job (e.g. blade servers share a backplane that tends to be faster than QPI, but will be the bottleneck if there are many nodes that need to continuously talk), you should configure the physical layout for each so that jobs can sit on individual NUMA nodes.
Basically /sys/devices/system/node/ shows you how things are enumerated, and what belongs to what, and there are scripts to help you convert that
Jobs
Given environment
A job inherits environment from the queue management when it is run, including:
- PBS_JOBID - job identifier that was assigned
- often sequentially numbered, so an implicitly unique temporary directory name
- PBS_O_HOST - hostname that ran qsub
- PBS_O_WORKDIR - directory from which the job was submitted (an absolute path)
- logs go here(verify)
- PBS_SERVER - hostname that runs pbs_server submitted to
- PBS_O_QUEUE - queuename that was submitted to
- PBS_QUEUE - queuename that is being executed in (usually the same)
- PBS_JOBNAME - user-supplied job name
- PBS_JOBDIR - path job is staged and executed in
- PBS_ENVIRONMENT - execution mode; "PBS_INTERACTIVE" if interactive via -I, "PBS_BATCH" otherwise
- PBS_NODEFILE - file containing nodes assigned to this job
- PBS_O_PATH - PATH from submission environment
Controlling further environment
shell environment:
- have a shell file that .pbs files will source.
- may be the most controlled and admin-healthy
- -v namelist - environment variable to be exported from qsub environment to job environment (variable), or defined there (variable=value)
- -V requests that qsub copy all of its environment to the job's enviroment. Be careful with this.
Similarly, when working directory matters,
the most controlled way is to do it yourself. The given environment variables (see above) help here.
Management and status
qsub your_script
- submits a job
- often a script, e.g. one that starts an MPI job.
- You can hand in resource requests to qsub, or, more commonly, have them read from comments in the script file.
qstat (torque)
- list jobs in queue
- by default not so verbose, though has XML output (qstat -x) that can be useful for tooling
- the most useful job states:
- Q, queued, eligible to run
- H, held - (usually means dependency on another job, but can have other reasons)
- Q, waiting, for its execution time
- R, running
- E, exiting after its run
- C, completed after having run (sticks around for a few minutes, just meant to report stuff that recently finished)
- qstat docs
pbsnodes (torque)
- gives the status of individual nodes, including what job(-id)s they are running
- pbsnodes docs
- it also has XML output
showq (Maui)
- overview of queues, jobs, how much of the (known) cluster is occupied
qdel
- remove job you've given up on
- most commonly with the job ID
- qdel docs
TODO: read:
- http://wiki.ibest.uidaho.edu/index.php/Tutorial:_Submitting_a_job_using_qsub
- http://www.clusterresources.com/torquedocs21/users/2.1jobsubmission.shtml
- http://www.clusterresources.com/torquedocs21/users/2.2files.shtml
- http://www-theor.ch.cam.ac.uk/IT/servers/maui/maui-admin.html
Planning - job properties, resource requests
Most of these can be specified on the qsub, e.g.
qsub -l walltime=10:00 myscript.pbs
or can be parsed out of comments in the script you hand to qsub, e.g.
#!/bin/bash #PBS -l walltime=10:00 do_stuff
-l resource details
Mostly reserved/guaranteed resources, which will also hold back a job in the queue until this can be met, so most useful as a "don't run on a node that cannot easily deal." or "don't run until I can have a lot of nodes." Some specs can be platform specific. See man pbs_resources for detail.
Some of these are restrictions which you promise your work will observe/ Some only as an indicator for people watching the queue, others as an "assume it's gone haywire and kill it automatically" (so always overestimate this a bit).
Examples:
- an amount of nodes (nodes=3), amount of processors per node (ppn=12) (default is 1 CPU on 1 node)
- host(s) by name (e.g. if you know it has favourable resources)
- host that have:
- arch - an architecture
- disk -
- maximum size of single file (file=)
- host - ...a certain name
- memory:
- pmem - free physical memory (maximum we would use)
- pvmem - process's virtual memory (maximum we would use)
- vmem - collective virtual memory (maximum we would use)
- time =hh:mm:ss
- walltime - wallclock time
- cput - CPU time (collective)
- pcput - CPU time (single process)
- a certain interconnect (e.g. infiniband or hippi)
- specific software installed
- a certain architecture
- -W additional job properties - you can e.g. ask jobs to start after another finishes, to start at the same time, and such.
- For example, to queue two jobs, and have the second wait for the first to finish (successfully or with an error)
# qsub first_job 7058.master # qsub -W depend=afterany:7058.master second_job
- -a
- execute only after a particular date'n'time
- -q destinationqueue
- queue (named queue on default server)
- @server (default queue on named server)
- queue@server (named queue on named server)
- -N jobname
- some restrictions, because this
- -p priority (-1024..1023, default is 0), controls which jobs to favour within a queue
Execution environment
- -w working dir (sets $PBS_O_WORKDIR)
- -u allows you to control under which user a job is run
- can be specified per host
- defaults to user running qsub
Changing your mind
For example, if you discover your job will run longer than you thought:
qalter jobid -l walltime=24:00:00
...though increasing the allocated time may be disabled for fairness, in which case overestimating beforehand is a better idea.
See also:
Node management
Interesting commands: pbsnodes and qmgr
List nodes with interesting states (e.g. down, offline): pbsnodes -l
Mark as offline: pbsnodes -o nodename
Return to service:
- Nicely: pbsnodes -r nodename (clears offline, marks down, then marks as free only when it pings fine)
- Forcefully: pbsnodes -c nodename
You can add a note on nodes, most useful when they are offline, down or unknown
Node states include:
- offline - marked as unavailable for jobs
- down - doesn't seem to be reachable
- up states:
- job-execlusive
- job-sharing
- reserve
- free - available for jobs
- busy
- time-shared
- state-unknown - e.g. happens when there are communication errors. (seen with down?)
Additionally valid when filtering:
- up (combination of: job-execlusive, job-sharing, reserve, free, busy, time-shared)
- active (combination of: job-exclusive, job-sharing, busy)
- all
See also
Unsorted
It seems that jobs will sometimes be shown in showq as using fewer processors than it really is. I haven't figured out why yet.
problems
Invalid credential MSG=Hosts do not match
...probably due to the way the trqauthd daemon works, which works with the value it reads from server_name.
Specifically, if it's different from the FQDN, auth will fail.
The typical fix is to edit server_name to reflect the FQDN and restart trqauthd.
This may require doing your name resolution differently.
Server has no node list MSG=node list is empty - check 'server_priv/nodes' file
That.
Edit it and restart pbs_server.
May be due to name resolution issues (verify)
Node nodename isn't declared to be NUMA, but mom is reporting
The node has multiple physical CPUs (e.g. dual Xeon, blade, etc), and pbs_server isn't configured to see it as such.
You probably want to check that that was intentional.
pbsnodes shows state = down
Nothing ever queued
qdel: Server could not connect to MOM
This error usually means the master can't contact the governing process on the compute node (node is down, there is a networking problem, or such), so it cannot request it to stop.
If you think it is a transient problem, wait and retry.
If you know the job really is completely gone from the node, e.g. because this (or all) the nodes rebooted, you can do (qdel -p jobid) - this basically clears the master's knowledge of what is happening on the nodes.
Note: If there are are active nodes with processes working, they will happily run on without the queue manager knowing about it. At best these nodes will be slower, at worst your jobs will fail.
ssh errors in PBS epilogue
You can check what your specific epilogue script is doing by finding it and reading it.
If it's the standard script then it'll be the "Killing leftovers..." bit, which for these purposes is just running a command on nodes using ssh. (from a script and non-interactively, which can help cause this. Debugging may be interesting as ssh and (pbs)psh may act differently)
In our case, the problem was that root did not have a keypair set up
(the netboot image lacked a /root/.ssh directory completely, in fact).
There are subtler reasons.