CASTEP

Both CASTEP 7 and CASTEP 8 are installed and available on ARCHER. We also have serial versions of the program compiled. The different versions can be accessed by loading the correct module.

Licensing and Access

CASTEP is licensed software. Please see the CASTEP web page for details. Users who wish to access the CASTEP package should submit a request via SAFE.

Running CASTEP

To run CASTEP you need to add the correct module:

module add castep

The current default CASTEP module will load CASTEP 8.0.0, but there are modules available for CASTEP 7.0.3, for previous versions of CASTEP 7, and for serial versions of CASTEP.

Once the module has been added the main CASTEP executable is available as castep.mpi. Executables for the tools distributed with CASTEP will also be available.

An example CASTEP job submission script is shown below.

#!/bin/bash --login
#PBS -N castep_job
#PBS -V

# Select 128 nodes (maximum of 3072 cores)
#PBS -l select=128
#PBS -l walltime=03:00:00

# Make sure you change this to your budget code
#PBS -A budget

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Load the CASTEP module
module add castep

# This line sets the temporary directory - without it CASTEP will fail
export TMPDIR=$PBS_O_WORKDIR
export CASTEP_TMPDIR=$PBS_O_WORKDIR


# Change the name of the input file to match your own job
aprun -n 3072 castep.mpi my_job

Compiling

Hints and Tips

When setting the $GFORTRAN_TMPDIR environment variable you must use the absolute /work path rather than a symbolic link to /work from the /home filesystem otherwise your calculation will fail. This is done in the example above by resetting the PBS_O_WORKDIR environment variable using the readlink command.

CASTEP has an option to optimise its run between speed and memory saving. This is controlled either by parameter:

OPT_STRATEGY_BIAS [=-3..3]

or equivalently

OPT_STRATEGY=MEMORY/DEFAULT/SPEED

Normally it should be recommended that users choose OPT_STRATEGY=SPEED/OPT_STRATEGY_BIAS=+3 unless the run attempts to use more memory than available. (Even in that case the first choice should be to increase the number of processors used and distribute the memory if possible.)

Optimising for Memory Use

Hybrid MPI/OpenMP Approach

In addition to MPI, CASTEP supports shared memory parallelism through OpenMP threading within a node. This allows for reductions in data duplication, resulting in a smaller memory footprint for a given system and enabling larger systems to be run within the memory confines of an ARCHER node.

To make use of this approach, CASTEP should be run with the nodes "unpacked" or "underpopulated", i.e. not with one process for each physical core. General documentation on the appropriate aprun flags for accomplishing this are at in Section 5.4.2 of the User Guide with a generic script for a hybrid MPI/OpenMP job in Section 5.4.7.

A CASTEP-specific example script is provided below which details a 40-node job with 6 MPI processes per node and 4-way threading. See the "Shared Memory parallel optimization of the FFT" section below for documentation on the NUM_PROC_IN_SMP environment variable.

#!/bin/bash --login
#PBS -A [acct]
#PBS -N [Jobname]
#PBS -l select=40
#PBS -l walltime=24:00:00
#PBS
#PBS -j eo

# As in above example script
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)
cd $PBS_O_WORKDIR
module add castep
export TMPDIR=$PBS_O_WORKDIR
export CASTEP_TMPDIR=$PBS_O_WORKDIR


# 40-node job with 6 MPI processes per node = 240 processes in total
# 4 OpenMP threads per MPI process
# 3 MPI processes per NUMA region
export KMP_AFFINITY=disabled
export OMP_NUM_THREADS=4
export NUM_PROC_IN_SMP=6
aprun -n 240 -N 6 -d 4 -S 3 ${CASTEP_EXE} ${SEEDNAME}

The numbers for this example were chosen for illustrative purposes and, for best results, users should tune the MPI-OpenMP ratio for their particular problem.

For further details, please see the eCSE01-017 project page and accompanying technical report.

Shared Memory parallel optimization of the FFT

The environment variable NUM_PROC_IN_SMPNote1 controls the use of shared-memory to optimize interconnect use by bundling up inter-node communication. This improves parallel scaling of the 3D FFT, particularly at large MPI process counts. The value should be in the range 1 to the number of MPI processes per node, although best performance is usually obtained with a value of 6.

Note1 - Using the environment variable to set NUM_PROC_IN_SMP was added in CASTEP 17.2; in earlier versions a parameter of the same name in the <SEEDNAME>.param input file was used.

Tuning k-point Parallelism

For calculations with large k-point counts, it may be appropriate to reduce the level of k-point parallelism with the following addition to your .param input file:

%BLOCK devel_code
    PARALLEL:kpoint=<n>:ENDPARALLEL
%ENDBLOCK devel_code

which explicitly sets the limit to <n>-way k-point parallelism.

CASTEP's default strategy is to maximize k-point parallelism as this is much less latency and bandwidth sensitive than G-vector (FFT) parallelism. Unfortunately k-point parallelism is more memory hungry than G-vector, and this can easily exceed the node memory. By tuning this kpoint parameter, memory use can be reduced by trading off latency and bandwidth.