4. Job Submission System

This section provides information on more advanced usage of the job submission system on ARCHER, it covers:

  • How to run multiple, concurrent parallel jobs in a single job submission script;
  • How to run multiple, concurrent parallel jobs using job arrays;
  • How to use Perl or Python to write job submission scripts.

The basics of running jobs through the job submission system on ARCHER can be found in the ARCHER User Guide at:

4.1 Multiple 'aprun' commands in a single job script

One of the most efficient ways of running multiple simulations in parallel on ARCHER is to use a single job submission script to run multiple simulations. This can be achieved by having multiple 'aprun' commands in a single script and requesting enough resources from the batch system to run them in parallel.

The examples in this section all assume you are using the bash shell for your job submission script but the principles are easily adapted to Perl, Python or other shells.

This technique is particularly useful if you have many jobs that use a small number of cores that you want to run simultaneously as the job looks to the batch system like a single large job and is thus easier to schedule.

Note: each 'aprun' command must run on a separate compute node as ARCHER only allows exclusive node access. This means you cannot use this technique to run multiple instances of a program on a single compute node.

Note: the maximum number of 'aprun' commands you can run in parallel in a single job is 500. (The maximum number of processes in a single job is 600: if you have a complicated script that has many other background processes in addition to 500 'aprun' commands then you may exceed this process limit.)

Note: when you run multiple simulations in parallel, the job must wait until the longest simulation finishes. This will cause some wasted allocation unless your simulations each take exactly the same time. For example, with 64 nodes per simulation and 16 apruns in parallel, if 15 simulations take 1 hour and one takes 2 hours, you will do (15 * 1 + 1 * 2 ) * 64 = 1088 node hours of useful work and waste (15 * 1 * 64) = 960 node hours. Similarly, if any of your simulations fail, you will waste the remaining allocation for that simulation while the other simulations run to completion.

4.1.1 Requesting the correct number of cores

The total number of cores requested for a job of this type is the sum of the number of cores required for all the simulations in the script. For example, if I have 16 simulations which each run using 64 nodes (1536 cores) then I would need to ask for 1024 nodes (24576 cores in total).

4.1.2 Multiple 'aprun' syntax

The differences from specifying a single aprun command to specifying multiple 'aprun' commands in your job submission script is that each of the aprun command must be run in the background (i.e. appended with an &) and there must be a 'wait' command after the final aprun command. For example, to run 4 CP2K simulations which each use 64 nodes (256 nodes in total) and 24 cores per node:

cd $basedir/simulation1/
aprun -n 1536 cp2k.popt < input1.cp2k > output1.cp2k &
cd $basedir/simulation2/
aprun -n 1536 cp2k.popt < input2.cp2k > output2.cp2k &
cd $basedir/simulation3/
aprun -n 1536 cp2k.popt < input3.cp2k > output3.cp2k &
cd $basedir/simulation4/
aprun -n 1536 cp2k.popt < input4.cp2k > output4.cp2k &

# Wait for all simulations to complete
wait

of course, this could have been more concisely achieved using a loop:

for i in {1..4}; do
  cd $basedir/simulation${i}/
  aprun -n 1536 cp2k < input${i}.cp2k > output${i}.cp2k &
done

# Wait for all simulations to complete
wait

4.1.3 Example job submission script

This job submission script runs 16, 64 node CP2K simulations in parallel with the input in the directories 'simulation1', 'simulation2', etc.:

#!/bin/bash --login
# The jobname
#PBS -N your_job_name

# The total number of nodes for your job.
#    This is the sum of the number of nodes required by each
#    of the aprun commands you are using. In this example we have
#    16 * 64 = 1024 nodes
#PBS -l select=1024

# Specify the wall clock time required for your job.
#    In this example we want 6 hours 
#PBS -l walltime=6:0:0

# Specify which budget account that your job will be charged to.
#PBS -A your_budget_account               

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# The base directory is the directory that the job was submitted from.
# All simulations are in subdirectories of this directory.
basedir=$PBS_O_WORKDIR

# Load the cp2k module
module add cp2k

# Set the number of threads to 1
#   This prevents any system libraries from automatically 
#   using threading.
export OMP_NUM_THREADS=1

# Loop over simulations, running them in the background
for i in {1..16}; do
   # Change to the directory for this simulation
   cd $basedir/simulation${i}/
   aprun -n 1536 cp2k.popt < input${i}.cp2k > output${i}.cp2k &
done

# Wait for all jobs to finish before exiting the job submission script
wait
exit 0

In this example, it is assumed that all of the input for the simulations has been setup prior to submitting the jobs. Of course, in reality, you may find that it is more useful for the job submission script to programmatically prepare the input for each job before the aprun command.

4.1.4 Oversubscribing nodes

Using the ALPS scheduler to queue up aprun commands (oversubscription) is no longer recommended. Instead, we recommend you use a simple polling loop to work through a large set of simulations being processed on a smaller set of nodes.

Remember that you cannot use aprun to run more than one application on a single node at the same time, and none of the single aprun commands can ask for more nodes than the job has in total (this will crash the job with an apsched error).

This can be useful for efficiently using resources if you have many simulations that all have different run times. Without additional commands to run, the nodes that finished work first would be idle for the rest of the job, wasting resources. However, once there are no more simulations queued, you will wait until the longest remaining simulation finishes and this may waste resources, as described above.

As always, if you are setting up a workflow like this, we strongly recommend that you test the setup exhaustively before using it in production due to the potential to waste large amounts of compute time if the scripting is incorrect.

This is an example PBS job submission script to use a polling loop for 1000 simulations, each running on one node (24 MPI processes), and with 100 nodes requested. This script is for the simplest case, where each job requires the same number of nodes.

#!/bin/bash --login

# An alternative to using the aprun scheduler to queue up the waiting
# apruns is to execute a polling loop in the script (the version of
# bash we have on Archer does not have wait -n).

# This is an example PBS job submission script to use a polling loop
# for 1000 simulations, each running on one node (24 MPI processes),
# and with 100 nodes requested.

# PBS job options (name, compute nodes (each node has 24 cores), job time)
# PBS -N is the job name (e.g. Polling_Job)
#PBS -N Polling_Job
# PBS -l select is the number of nodes requested (e.g. 100 nodes=2400
# cores).  This should be p * number of nodes used in each simulation.
# In this example p = 100 and number of nodes used in each simulation
# = 1
#PBS -l select=100
# PBS -l walltime, maximum walltime allowed (e.g. 20 minutes)
#PBS -l walltime=00:20:00

# Replace [budget code] below with your project code (e.g. t01)
#PBS -A [budget code]

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)               

# Change to the directory that the job was submitted from
# (remember this should be on the /work filesystem)
cd $PBS_O_WORKDIR

# Set the number of threads to 1.  This prevents any system libraries
# from automatically using threading.
export OMP_NUM_THREADS=1

n=1000
p=100 # Maximum value 500.  "PBS -l select" = p * number of nodes used
      # in each simulation
if ((p > 500)); then
    echo >&2
    echo >&2 "$p apruns in parallel were requested."
    echo >&2 'The maximum number of apruns in parallel is 500.'
    echo >&2 'Please re-submit with p <= 500 and adjust the number of nodes'
    echo >&2 'in the "PBS -l select" statement.'
    echo >&2
    exit 1
fi
poll=60 # Seconds.  Minimum value 1.  Choose this to be a few percent
	# of your typical execution time, e.g. if each executable
	# takes about an hour to run, set poll to 60 seconds.

# The polling loop has to be in a sub-shell (i.e., between "(" and ")"
# ) so that only the background jobs associated with the apruns are
# counted.  And this means that you should not have any other
# background jobs in the sub-shell.
(
    for ((i=0; i<n; i++)); do
	while (($(jobs -r | wc -l) >= p)); do
	    sleep $poll
	done

# This version runs everything in the current directory. The
# executables are called job0, job2, ..., job999.
	aprun -n 24 job$i &
	# If the executable reads standard input, then standard input
	# must be redirected, e.g.
	# aprun -n 24 job$i < input$i
	# If the executable writes standard output or standard error,
	# then these should be redirected, e.g.
	# aprun -n 24 job$i > output$i 2> error$i

## This version has one directory per aprun, running an executable
## called job in that directory (called directory0, directory2, ...,
## directory999).  The directories must already exist.
#	(cd directory$i; aprun -n 24 job) &
#	# If the executable reads standard input, then standard input
#	# must be redirected, e.g.
#	# aprun -n 24 job < input
#	# If the executable writes standard output or standard error,
#	# then these should be redirected, e.g.
#	# aprun -n 24 job > output 2> error

    done
    wait # for the last p apruns to finish
)

4.2 Job arrays

Often, you will want to run the same job submission script multiple times in parallel for many different input parameters. Job arrays provide a mechanism for doing this without the need to issue multiple 'qsub' commands and without the penalty of having large numbers of jobs appearing in the queue.

The number of array elements has been capped at 32 per job. This still allows any user to submit 512 jobs at a time.

4.2.1 Example job array submission script

Each job instance in the job array is able to access its unique array index through the environment variable $PBS_ARRAY_INDEX.

This can be used to programmatically select which set of input parameters you want to use. One common way to use job arrays is to place the input for each job instance in a separate subdirectory which has a number as part of its name. For example, if you have 10 sets of input in ten subdirectories called job01, job02, …, job10 then you would be able to use the following script to run a job array that runs each of these jobs:

#!/bin/bash --login
#PBS -r y
#PBS -N your_job_name
#PBS -l select=64
#PBS -l walltime=00:20:00
# Specify which budget account that your job will be charged to.
#PBS -A your_budget_account               

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Change to the direcotry that the job was submitted from.
cd $PBS_O_WORKDIR

# Set the number of threads to 1
#   This prevents any system libraries from automatically 
#   using threading.
export OMP_NUM_THREADS=1

# Get the subdirectory name for this job instance in the array
jobid=`printf "%02d" $PBS_ARRAY_INDEX`
jobdir="job$jobid"

# Change to the subdirctory for this job instance in the array
cd $jobdir

# Run this job instance in its subdirectory
echo "Running $jobname"
aprun -n 1536  ./my_mpi_executable.x arg1 arg2 > my_stdout.txt 2> my_stderr.txt

4.2.2 Submitting job arrays

The '-J' option to the 'qsub' command is used to submit a job array under PBSPro. For example, to submit a job array consisting of 10 instances, numbered from 1 to 10 you would use the command:

J = 1, ..., 32 as there is an upper limit of 32 array elements per job.

qsub -J 1-10 array_job_script.pbs

You can also specify a stride other than 1 for array jobs. For example, to submit a job array consiting of 5 instances, numbered 2, 4, 6, 8, and 10 you would use the command:

qsub -J 2-10:2 array_job_script.pbs

4.2.3 Interacting with individual job instances in an array

You can refer to individual job instance in a job array by using their array index. For example, to delete just the job instance with array index 5 from the batch system (assuming your job ID is 1234), you would use:

qdel 1234[5]

4.3 Writing job submission scripts in Perl and Python

It can often be useful to be able to use the features of Perl and/or Python to write more complex job submission scripts. The richer programming environment available over standard shell scripts can make it easier to dynamically generate input for jobs or put complex workflows together.

Please note that the examples provided in this section are so simple that they could easily be written in bash but they provide the necessary information needed to be able to use Perl and Python to write your own, more complex, job submission scripts.

You submit Perl and Python job submission scripts using 'qsub' as for standard jobs.

4.4.1 Example Perl job submission script

This example script shows how to run a CP2K job using Perl. It illustrates the necessary system calls to change directories and load modules within a Perl script but does not contain any program complexity.

#!/usr/bin/perl

# The jobname
#PBS -N your_job_name

# The total number of nodes for your job.
# The example requires 64 nodes
#PBS -l select=64

# The walltime needed
#PBS - l walltime=24:00:00

# Set the budget to charge the job to. Change to your code
#PBS -A budget

# Set the number of MPI tasks and MPI tasks per node
my $mpiTasks = 1536;

# Set the executable name and input and output files
my $execName = "cp2k.popt";
my $inputName = "input";
my $outputName = "output";
my $runCode = "$execName < $inputName > $outputName";

# Set up the string to run our job
my $aprunString = "aprun -n $mpiTasks $runCode";

# Set the command to load the cp2k module
#   This is more complicated in Perl as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   aprun command
my $moduleString = "source /etc/profile; module load cp2k;";

# Change to the directory the job was submitted from
chdir($ENV{'PBS_O_WORKDIR'});

# Run the job
#    This is a combination of the module loading string and the
#    actual aprun command. Both of these are set above.
system("$moduleString  $aprunString");

# Exit the job
exit(0);

4.3.2 Example Python job submission script

This section has been updated with instructions based on material from iVEC:

This example script shows how to run a CP2K job using Python. It illustrates the necessary system calls to change directories and load modules within a Python script but does not contain any program complexity.

#!/usr/bin/env python

# The jobname
#PBS -N template
# The total number of nodes for your job.
#PBS -l select=1
# The walltime needed
#PBS -l walltime=00:20:00
# Set the budget to charge the job to (change "budget" to your code)
#PBS -A budget
# Queue to use (e.g. short queue for testing).  This line is not
# nesessary if the standard queue is used
#PBS -q short

# Import the Python modules required for system operations
# To get environment variables
import os
# To run shell commands
import subprocess

# This short function adds a module to your environment
def load_module(moduleloc, modulename):
  p = subprocess.Popen(
        "{0}/bin/modulecmd python load {1}".format(moduleloc, modulename),
        stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True
      )
  stdout,stderr = p.communicate()
  exec stdout
  
# This short function removes a module from your environment
def unload_module(moduleloc, modulename):
  p = subprocess.Popen(
        "{0}/bin/modulecmd python unload {1}".format(moduleloc, modulename),
        stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True
      )
  stdout,stderr = p.communicate()
  exec stdout
  
# This short function switches modules in your environment
def switch_module(moduleloc, modulename_from, modulename_to):
  p = subprocess.Popen(
        "{0}/bin/modulecmd python switch {1} {2}".format(moduleloc, modulename_from, modulename_to),
        stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True
      )
  stdout,stderr = p.communicate()
  exec stdout
  
# Get the location of the modules installation
moduleloc = os.environ["MODULESHOME"]

# Make sure that the modules environment is setup correctly
execfile("{0}/init/python.py".format(moduleloc))

# Set any modules required for the run (note, these are for
# illustration and may not work correctly for cp2k)
switch_module(moduleloc, "PrgEnv-cray", "PrgEnv-intel")
#unload_module(moduleloc, "some_module_to_unload")
load_module(moduleloc, "cp2k")

# Print a summary of the loaded modules
print os.environ['LOADEDMODULES']

# Change to the directory the job was submitted from
os.chdir(os.environ["PBS_O_WORKDIR"])

# Set the number of MPI tasks
mpiTasks = 24

# This section sets up a list with the aprun command
runcommand = []
# aprun to launch jobs and command line options
runcommand.append("aprun")
runcommand.append("-n {0}".format(mpiTasks))

# Set the executable name and input and output files
execName = "cp2k.popt"
inputName = "input"

# Add executable name to the command
runcommand.append(execName)

print runcommand
p = subprocess.Popen(runcommand, stdin=subprocess.PIPE,
            stdout=subprocess.PIPE, stderr=subprocess.PIPE)

inputFile = open(inputName, 'r')
inputData = inputFile.read()
inputFile.close()
stdout, stderr = p.communicate(input=inputData)

4.4 Job chaining

Often users want to chain multiple jobs together to produce a coherent run made up of multiple job instances, for example, if you are running a simulation that generates a long trajectory.

There are a number of different ways to do this each with their own advantages and disadvantages. We will cover two techniques that are commonly used on ARCHER:

  • 4.4.1 Submit next job from end of preceding job
  • 4.4.2 Using PBS job dependencies

4.4.1 Submit next job from end of preceding job

You do this by simply inserting a qsub command at the the end of your job submission script that submits the next job in the chain, e.g.:

cd ../job2
qsub job2.pbs

You should be aware that if the job is stopped due to running over walltime or crashes then the submit statement will never be reached and hence the chain will be broken.

To avoid the problem of running out of walltime you can use the leave_time utility to stop a runnig application a certain amount of time before the end of the job to give you time to submit the next job in the chain.

Note: this mechanism has the potential to produce a never-ending stream of jobs unless it is carefully controlled. You should ensure that your scripts have mechanisms put in place that check that the jobs are working correctly before submitting the next job in the chain.

4.4.2 Using PBS job dependencies

PBS allows you to specify dependencies between particular jobs, so a job could be submitted that will not be eligible to run until another specified job has completed successfully.

You use the "-W" option to qsub to specify job dependencies. Any jobs submitted with dependencies will be placed into a Hold state until the dependencies are met.

There are a number of different dependency options but the most common dependency used is "afterok"; this executes the current job after listed jobs have terminated without error.

For example, to submit a job that is dependent on job 113567.sdb completing OK, you would use:

qsub -W depend=afterok:113567 submit.pbs

(where submit.pbs is your job submission script). The submitted job would go into the queue and be placed in a Hold state until job 113567.sdb had completed successfully. If job 113567.sdb fails, then you would need to manually delete the dependent job (or modifiy the dependency using qalter).

You can find more information on PBS job dependencies and the full list of dependency options in the man page for qsub. This can be accessed on ARCHER with the command:

man qsub

Note: you should always test any mechanism that uses job dependencies before using it in production runs as the potential exists to waste large amounts of computational resource if dependent jobs do not execute as expected.