4. Job Submission System
This section provides information on more advanced usage of the job submission system on ARCHER, it covers:
- How to run multiple, concurrent parallel jobs in a single job submission script;
- How to run multiple, concurrent parallel jobs using job arrays;
- How to use Perl or Python to write job submission scripts.
The basics of running jobs through the job submission system on ARCHER can be found in the ARCHER User Guide at:
4.1 Multiple 'aprun' commands in a single job script
One of the most efficient ways of running multiple simulations in parallel on ARCHER is to use a single job submission script to run multiple simulations. This can be achieved by having multiple 'aprun' commands in a single script and requesting enough resources from the batch system to run them in parallel.
The examples in this section all assume you are using the bash shell for your job submission script but the principles are easily adapted to Perl, Python or other shells.
This technique is particularly useful if you have many jobs that use a small number of cores that you want to run simultaneously as the job looks to the batch system like a single large job and is thus easier to schedule.
Note: each 'aprun' command must run on a separate compute node as ARCHER only allows exclusive node access. This means you cannot use this technique to run multiple instances of a program on a single compute node.
Note: the maximum number of 'aprun' commands you can run in parallel in a single job is 500. (The maximum number of processes in a single job is 600: if you have a complicated script that has many other background processes in addition to 500 'aprun' commands then you may exceed this process limit.)
Note: when you run multiple simulations in parallel, the job must wait until the longest simulation finishes. This will cause some wasted allocation unless your simulations each take exactly the same time. For example, with 64 nodes per simulation and 16 apruns in parallel, if 15 simulations take 1 hour and one takes 2 hours, you will do (15 * 1 + 1 * 2 ) * 64 = 1088 node hours of useful work and waste (15 * 1 * 64) = 960 node hours. Similarly, if any of your simulations fail, you will waste the remaining allocation for that simulation while the other simulations run to completion.
4.1.1 Requesting the correct number of cores
The total number of cores requested for a job of this type is the sum of the number of cores required for all the simulations in the script. For example, if I have 16 simulations which each run using 64 nodes (1536 cores) then I would need to ask for 1024 nodes (24576 cores in total).
4.1.2 Multiple 'aprun' syntax
The differences from specifying a single aprun command to specifying multiple 'aprun' commands in your job submission script is that each of the aprun command must be run in the background (i.e. appended with an &) and there must be a 'wait' command after the final aprun command. For example, to run 4 CP2K simulations which each use 64 nodes (256 nodes in total) and 24 cores per node:
cd $basedir/simulation1/ aprun -n 1536 cp2k.popt < input1.cp2k > output1.cp2k & cd $basedir/simulation2/ aprun -n 1536 cp2k.popt < input2.cp2k > output2.cp2k & cd $basedir/simulation3/ aprun -n 1536 cp2k.popt < input3.cp2k > output3.cp2k & cd $basedir/simulation4/ aprun -n 1536 cp2k.popt < input4.cp2k > output4.cp2k & # Wait for all simulations to complete wait
of course, this could have been more concisely achieved using a loop:
for i in {1..4}; do cd $basedir/simulation${i}/ aprun -n 1536 cp2k < input${i}.cp2k > output${i}.cp2k & done # Wait for all simulations to complete wait
4.1.3 Example job submission script
This job submission script runs 16, 64 node CP2K simulations in parallel with the input in the directories 'simulation1', 'simulation2', etc.:
#!/bin/bash --login # The jobname #PBS -N your_job_name # The total number of nodes for your job. # This is the sum of the number of nodes required by each # of the aprun commands you are using. In this example we have # 16 * 64 = 1024 nodes #PBS -l select=1024 # Specify the wall clock time required for your job. # In this example we want 6 hours #PBS -l walltime=6:0:0 # Specify which budget account that your job will be charged to. #PBS -A your_budget_account # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # The base directory is the directory that the job was submitted from. # All simulations are in subdirectories of this directory. basedir=$PBS_O_WORKDIR # Load the cp2k module module add cp2k # Set the number of threads to 1 # This prevents any system libraries from automatically # using threading. export OMP_NUM_THREADS=1 # Loop over simulations, running them in the background for i in {1..16}; do # Change to the directory for this simulation cd $basedir/simulation${i}/ aprun -n 1536 cp2k.popt < input${i}.cp2k > output${i}.cp2k & done # Wait for all jobs to finish before exiting the job submission script wait exit 0
In this example, it is assumed that all of the input for the simulations has been setup prior to submitting the jobs. Of course, in reality, you may find that it is more useful for the job submission script to programmatically prepare the input for each job before the aprun command.
4.1.4 Oversubscribing nodes
Using the ALPS scheduler to queue up aprun commands (oversubscription) is no longer recommended. Instead, we recommend you use a simple polling loop to work through a large set of simulations being processed on a smaller set of nodes.
Remember that you cannot use aprun to run more than one application on a single node at the same time, and none of the single aprun commands can ask for more nodes than the job has in total (this will crash the job with an apsched error).
This can be useful for efficiently using resources if you have many simulations that all have different run times. Without additional commands to run, the nodes that finished work first would be idle for the rest of the job, wasting resources. However, once there are no more simulations queued, you will wait until the longest remaining simulation finishes and this may waste resources, as described above.
As always, if you are setting up a workflow like this, we strongly recommend that you test the setup exhaustively before using it in production due to the potential to waste large amounts of compute time if the scripting is incorrect.
This is an example PBS job submission script to use a polling loop for 1000 simulations, each running on one node (24 MPI processes), and with 100 nodes requested. This script is for the simplest case, where each job requires the same number of nodes.
#!/bin/bash --login # An alternative to using the aprun scheduler to queue up the waiting # apruns is to execute a polling loop in the script (the version of # bash we have on Archer does not have wait -n). # This is an example PBS job submission script to use a polling loop # for 1000 simulations, each running on one node (24 MPI processes), # and with 100 nodes requested. # PBS job options (name, compute nodes (each node has 24 cores), job time) # PBS -N is the job name (e.g. Polling_Job) #PBS -N Polling_Job # PBS -l select is the number of nodes requested (e.g. 100 nodes=2400 # cores). This should be p * number of nodes used in each simulation. # In this example p = 100 and number of nodes used in each simulation # = 1 #PBS -l select=100 # PBS -l walltime, maximum walltime allowed (e.g. 20 minutes) #PBS -l walltime=00:20:00 # Replace [budget code] below with your project code (e.g. t01) #PBS -A [budget code] # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the directory that the job was submitted from # (remember this should be on the /work filesystem) cd $PBS_O_WORKDIR # Set the number of threads to 1. This prevents any system libraries # from automatically using threading. export OMP_NUM_THREADS=1 n=1000 p=100 # Maximum value 500. "PBS -l select" = p * number of nodes used # in each simulation if ((p > 500)); then echo >&2 echo >&2 "$p apruns in parallel were requested." echo >&2 'The maximum number of apruns in parallel is 500.' echo >&2 'Please re-submit with p <= 500 and adjust the number of nodes' echo >&2 'in the "PBS -l select" statement.' echo >&2 exit 1 fi poll=60 # Seconds. Minimum value 1. Choose this to be a few percent # of your typical execution time, e.g. if each executable # takes about an hour to run, set poll to 60 seconds. # The polling loop has to be in a sub-shell (i.e., between "(" and ")" # ) so that only the background jobs associated with the apruns are # counted. And this means that you should not have any other # background jobs in the sub-shell. ( for ((i=0; i<n; i++)); do while (($(jobs -r | wc -l) >= p)); do sleep $poll done # This version runs everything in the current directory. The # executables are called job0, job2, ..., job999. aprun -n 24 job$i & # If the executable reads standard input, then standard input # must be redirected, e.g. # aprun -n 24 job$i < input$i # If the executable writes standard output or standard error, # then these should be redirected, e.g. # aprun -n 24 job$i > output$i 2> error$i ## This version has one directory per aprun, running an executable ## called job in that directory (called directory0, directory2, ..., ## directory999). The directories must already exist. # (cd directory$i; aprun -n 24 job) & # # If the executable reads standard input, then standard input # # must be redirected, e.g. # # aprun -n 24 job < input # # If the executable writes standard output or standard error, # # then these should be redirected, e.g. # # aprun -n 24 job > output 2> error done wait # for the last p apruns to finish )
4.2 Job arrays
Often, you will want to run the same job submission script multiple times in parallel for many different input parameters. Job arrays provide a mechanism for doing this without the need to issue multiple 'qsub' commands and without the penalty of having large numbers of jobs appearing in the queue.
The number of array elements has been capped at 32 per job. This still allows any user to submit 512 jobs at a time.
4.2.1 Example job array submission script
Each job instance in the job array is able to access its unique array index through the environment variable $PBS_ARRAY_INDEX.
This can be used to programmatically select which set of input parameters you want to use. One common way to use job arrays is to place the input for each job instance in a separate subdirectory which has a number as part of its name. For example, if you have 10 sets of input in ten subdirectories called job01, job02, …, job10 then you would be able to use the following script to run a job array that runs each of these jobs:
#!/bin/bash --login #PBS -r y #PBS -N your_job_name #PBS -l select=64 #PBS -l walltime=00:20:00 # Specify which budget account that your job will be charged to. #PBS -A your_budget_account # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the direcotry that the job was submitted from. cd $PBS_O_WORKDIR # Set the number of threads to 1 # This prevents any system libraries from automatically # using threading. export OMP_NUM_THREADS=1 # Get the subdirectory name for this job instance in the array jobid=`printf "%02d" $PBS_ARRAY_INDEX` jobdir="job$jobid" # Change to the subdirctory for this job instance in the array cd $jobdir # Run this job instance in its subdirectory echo "Running $jobname" aprun -n 1536 ./my_mpi_executable.x arg1 arg2 > my_stdout.txt 2> my_stderr.txt
4.2.2 Submitting job arrays
The '-J' option to the 'qsub' command is used to submit a job array under PBSPro. For example, to submit a job array consisting of 10 instances, numbered from 1 to 10 you would use the command:
J = 1, ..., 32 as there is an upper limit of 32 array elements per job.
qsub -J 1-10 array_job_script.pbs
You can also specify a stride other than 1 for array jobs. For example, to submit a job array consiting of 5 instances, numbered 2, 4, 6, 8, and 10 you would use the command:
qsub -J 2-10:2 array_job_script.pbs
4.2.3 Interacting with individual job instances in an array
You can refer to individual job instance in a job array by using their array index. For example, to delete just the job instance with array index 5 from the batch system (assuming your job ID is 1234), you would use:
qdel 1234[5]
4.3 Job chaining
Often users want to chain multiple jobs together to produce a coherent run made up of multiple job instances, for example, if you are running a simulation that generates a long trajectory.
There are a number of different ways to do this each with their own advantages and disadvantages. We will cover two techniques that are commonly used on ARCHER:
- 4.4.1 Submit next job from end of preceding job
- 4.4.2 Using PBS job dependencies
4.3.1 Submit next job from end of preceding job
You do this by simply inserting a qsub command at the the end of your job submission script that submits the next job in the chain, e.g.:
cd ../job2 qsub job2.pbs
You should be aware that if the job is stopped due to running over walltime or crashes then the submit statement will never be reached and hence the chain will be broken.
To avoid the problem of running out of walltime you can use the leave_time utility to stop a runnig application a certain amount of time before the end of the job to give you time to submit the next job in the chain.
Note: this mechanism has the potential to produce a never-ending stream of jobs unless it is carefully controlled. You should ensure that your scripts have mechanisms put in place that check that the jobs are working correctly before submitting the next job in the chain.
4.3.2 Using PBS job dependencies
PBS allows you to specify dependencies between particular jobs, so a job could be submitted that will not be eligible to run until another specified job has completed successfully.
You use the "-W" option to qsub to specify job dependencies. Any jobs submitted with dependencies will be placed into a Hold state until the dependencies are met.
There are a number of different dependency options but the most common dependency used is "afterok"; this executes the current job after listed jobs have terminated without error.
For example, to submit a job that is dependent on job 113567.sdb completing OK, you would use:
qsub -W depend=afterok:113567 submit.pbs
(where submit.pbs is your job submission script). The submitted job would go into the queue and be placed in a Hold state until job 113567.sdb had completed successfully. If job 113567.sdb fails, then you would need to manually delete the dependent job (or modifiy the dependency using qalter).
You can find more information on PBS job dependencies and the full list of dependency options in the man page for qsub. This can be accessed on ARCHER with the command:
man qsub
Note: you should always test any mechanism that uses job dependencies before using it in production runs as the potential exists to waste large amounts of computational resource if dependent jobs do not execute as expected.