Optimising CASTEP on Intel's Knight's Landing Platform
eCSE11-17Key Personnel
PI/Co-I: Phil Hasnip (University of York)
Technical: Arjen Tamerus (University of Cambridge), Edward Higgins (University of York)
Relevant documents
eCSE Technical Report: Optimising CASTEP on Intel's Knight's Landing Platform
Project summary
CASTEP is a widely-used, UK-developed software package, capable of predicting the properties of materials from "first-principles"; that is, by solving quantum mechanical equations to determine what the behaviour is, without the need for adjustable parameters. CASTEP was designed from the beginning to run well on conventional parallel HPC machines, but in recent years a number of new computer architectures have emerged which do not follow the conventional trends for CPUs. One such architecture is Intel's Knights Landing (KNL).
Knights Landing's theoretical performance is very high, but its performance profile differs considerably from that of a conventional CPU. The differences are mostly because KNL is comprised of many low-power cores, rather than a few high-power cores. It is also designed to work most efficiently when it can perform the same operation on a lot of data, so-called "vector" instructions. In order to actually obtain good performance on KNL, a program must be able to make use of this vector capability.
In this project the ability of CASTEP to use KNL efficiently was measured, with particular attention to which of its operations were not using the vector instructions effectively. The parts of the program which were least efficient were rewritten in order to use vector instructions better, and boost performance on KNL, taking care to ensure that the performance on more conventional CPUs did not suffer. The speed of several of CASTEP's subroutines was more than doubled, and the overall performance of CASTEP on KNL was increased by a factor of 1.3. For example on the "TiN" benchmark, running on a single KNL node, the original version of CASTEP took 279s to complete 20 iterations, and the optimised version took 179s; a speed-up of more than a factor of 1.5.
CASTEP has a large user-base, currently over 850 academic research groups and many companies worldwide. Within the UK, it is used frequently by the UKCP and the Materials Chemistry HEC consortia and by members of CCP9 and CCP-NC for crystalline NMR simulations. This user-base spans a wide range of materials research in Physics, Chemistry, Materials Science, Earth Sciences and Engineering departments. The success of this project means that CASTEP can now be used far more efficiently on KNL machines, such as the KNL nodes of ARCHER and the much larger Tier-2 KNL machine at CSD3.
Achievement of objectives
Work Package 1: Optimise identified computational bottlenecks.
Objective 1: Improvement in KNL performance of bottlenecks by at least 25% without compromising performance on conventional CPUs.
This objective was met and exceeded greatly, with the identified bottlenecks each being sped up by a factor of 2.5 or more.
Work Package 2: Investigate and improve CASTEP's vectorisation (2 months).
Objective 2a: Identify compute-intensive operations which are not vectorised at present.
Objective 2b: Improve the vectorisation of these bottlenecks to reduce their runtime on KNL by at
least 25% (without compromising performance on conventional CPUs).
These objectives were met. Two routines were identified which performed poorly with long vector lengths. The operations were reworked, resulting in a speed-up of 1.24-1.67 depending on compiler and problem size. A further subroutine was identified which is only called in DFT+U calculations; this subroutine was reworked to replace explicit Fortran code by BLAS calls, with a speed-up of more than 2 on both KNL and conventional CPUs.
Work Package 3: Investigate use of explicit memory allocation (2 months).
Objective 3a: Create benchmark data for various memory modes (flat, cache, hybrid).
Objective 3b: Optimise flat-mode performance by defining what data will be placed in MCDRAM explicitly.
These objectives were met. However despite improving the flat-mode performance significantly, it still fell short of the performance in cached mode.
Work Package 4: Investigate use of multithreading (HT; 1 month).
Objective 4a: Create benchmark data for CASTEP on KNL in a variety of parallel modes.
Objective 4b: Identify key bottlenecks in the optimal parallel mode.
These objectives were met.
Summary of the Software
CASTEP (http://www.castep.org) is a UK-based state-of-the-art implementation of density functional theory and a flagship code for UK HPC. It was rewritten in 1999-2001 according to sound software engineering principles and with HPC in mind. It is available as a system-installed binary on ARCHER, and version 20.1, which includes the developments from this project, will be available following its release.
CASTEP and its source code are available generally under a free-of-charge licence to all academics (see http://www.castep.org/). Pre-compiled CASTEP programs are marketed worldwide by BIOVIA Inc. along with their GUI; for more information, see http://castep.org/CASTEP/GettingCASTEP.
The refactored code has already been merged into the codebase for CASTEP 20.1, to be released later in 2019. Some of the early improvements, such as those to the DFT+U code, were merged in time for the 19.1 release (Dec 2019).
Scientific Benefits
CASTEP is currently used by over 850 academic research groups worldwide and many companies. Within the UK, it is used extensively by the UKCP HEC Consortium and many other users, including members of the Materials Chemistry HEC consortium, and the Collaborative Computational Programmes CCP5, CCP9 and CCP-NC (for crystalline NMR simulations). This user base spans a wide range of materials research in Physics, Chemistry, Materials Science, Earth Sciences and Engineering departments.
The value of this eCSE is that it allows CASTEP to be used much more efficiently on KNL hardware, with a speed-up of around 1.5 times compared to the original CASTEP code. Naturally, in order to exploit this facility a user must have access to such hardware, but these are readily available in the UK via ARCHER or Tier-2 HPC resources.