Optimal parallelisation in CASTEP

eCSE08-010

Key Personnel

PI/Co-Is: Dr Phil Hasnip - University of York, Prof Keith Refson - Royal Holloway, University of London, Dr Ian Bush - University of Oxford and Dr Filippo Spiga - University of Cambridge

Technical: Arjen Tamerus, University of Cambridge

Relevant Documents

eCSE Technical Report: Optimal parallelisation in CASTEP

Project summary

Methods for performing first-principles simulations of materials (i.e. those that solve the quantum-mechanical Schrödinger equation via parameter-free approximations) have had a profound and pervasive impact on science and technology, spreading from physics, chemistry and materials science to diverse areas including electronics, geology and medicine. Methods based on density functional theory (DFT) have led the way in this success, offering a favourable balance of computational cost and accuracy compared to competitor methods. Nevertheless, as larger and larger systems are studied, the time and resources required to perform a DFT calculation grows, and in order to study scientifically-relevant systems it is essential to optimise DFT codes to allow large, efficient parallel calculations.

CASTEP is a DFT computer program developed in the UK and is commonly used to model systems of up to a few thousand atoms on ARCHER, using up to a few thousand cores (with larger systems scaling well to larger number of cores). It has consistently been in ARCHER’s top 5 codes by core hours, typically using 5-10% of the machine over the course of a year. CASTEP was rewritten from 1999-2001 in modern Fortran and using good software engineering practices, and has been actively developed since then; this project paves the way for CASTEP to run efficiently on next-generation and future exascale supercomputers.

When running a parallel CASTEP calculation using many threads, some tasks were performed on only a single thread. While most of this work takes a small portion of the runtime, as larger systems are studied the time taken by some of these routines becomes significant. This project aimed to distribute the work in these significant tasks across multiple threads, therefore significantly reducing the time taken.

CASTEP has a large user-base, currently over 850 academic research groups and many companies worldwide. Within the UK, it is used frequently by the UKCP and the Materials Chemistry HEC consortia and by members of CCP9 and CCP-NC for crystalline NMR simulations. This user-base spans a wide range of materials research in Physics, Chemistry, Materials Science, Earth Sciences and Engineering departments. The success of this project means that CASTEP can now be used more efficiently than ever before, and that calculations can now scale to larger core counts for an even greater speed-up. The development of an extensible parallel model also reduces the need to tuned input parameters, and provides a sustainable framework for improving the parallelism further in the future. The net result of this work is a new science capability, running larger system sizes in less time.

Achievement of objectives

CASTEP is a UK-developed and widely-used computer program for the quantum mechanical modelling of materials. This project contained 2 major work packages to improve the performance of CASTEP for large calculations on ARCHER:

1. Optimisation and extension of the OpenMP within CASTEP

Optimise and extend the OpenMP regions, to allow higher thread-counts to be used efficiently. Success will be measured by the improvement to the scaling of the calculation with threads; e.g. for the crambin (protein residue) or al3x3 (2D sapphire slab) benchmarks.

Objective 1: improvement in strong scaling with threads of a factor of at least 2, without increasing the memory footprint per node.

Achievements: OpenMP threading was extended to several further subroutines, in particular specialist matrix-matrix multiplications and the FFTs. Performance is improved for all OpenMP thread counts > 1, and for most benchmarks the lowest computation time is now achieved with 2-3 times the previous best number of threads (see Technical Report, Figure 3).

2. Development of a parallel performance model for CASTEP

Develop a parallel performance model for CASTEP to enable CASTEP to better optimise its parallelisation strategy automatically. Success will be measured by the parallel performance improvement compared to the present parallelisation choice.

Objective 2: creation and implementation of a parallel model of CASTEP performance, and automatic choice of parallelism parameters.

Achievements: A micro-benchmark was created to time the performance of the FFT for a variety of process groupings. This has been integrated into Castep, allowing automatic selection of the shared-memory usage of MPI processes and a reduction in run-time of over 25 % compared to using the default setting (and up to 10 % compared to using the recommended setting for ARCHER; see Technical Report, Figure 4).

Summary of the software

CASTEP (www.castep.org) is a UK-based state-of-the-art implementation of DFT and a flagship code for UK HPC. It was rewritten in 1999-2001 according to sound software engineering principles and with HPC in mind. It is available as a system-installed binary on ARCHER, and versions 18.1 and 18.2 (provisional numbering) - which include the developments from this project - will be available following their release.

CASTEP and its source code are available generally under a free-of-charge licence to all academics in the UK, from CCPForge.

CASTEP source code is also available (for a fee; currently €1800) to academics working elsewhere in the EU. Pre-compiled CASTEP programs are marketed worldwide by BIOVIA Inc. along with their GUI; for more information, see Getting CASTEP.

The threading optimisations performed during this project have already been merged into the codebase for CASTEP 18.1, to be released later this year. The parallel model will be merged and released in the following release. Both sets of developments will be fully integrated into the main codebase, and hence available to all academic and commercial users worldwide.