Hybrid parallelisation for the CRYSTAL code
eCSE09-19Key Personnel
PI/Co-I: Barry G. Searle (STFC), Leonardo Bernasconi (STFC), Ian J. Bush (University of Oxford), Nicholas M. Harrison (Imperial College London)
Technical: Barry G. Searle
Relevant documents
eCSE Technical Report: Hybrid parallelisation for the CRYSTAL code
Project summary
This project implements hybrid parallel programming concepts to improve the performance of the ab initio quantum-mechanical software CRYSTAL on modern supercomputers.
CRYSTAL is a world-leading program for the quantum-mechanical simulation of the electronic and vibrational properties of crystals, polymers, surfaces and molecules. It has unique capabilities in terms of accuracy and computational efficiency, and is the code of choice for many academic and commercial groups worldwide working on complex extended systems, strongly correlated materials and magnetic crystals. CRYSTAL derives its unique capabilities from the use of a local basis set of (non-orthogonal) atomic orbitals, expressed in terms of linear combinations of Gaussian basis functions. This basis set choice and the extremely efficient algorithms implemented in the code for the analytical calculation of two-electron repulsion integrals make CRYSTAL a virtually unique tool in solid state physics in terms of both accuracy and computational efficiency in large systems. The use of Gaussian basis functions makes CRYSTAL especially efficient at hybrid density functional approaches compared to plane wave based codes.
Modern supercomputers are composed of multiple nodes of multi-core CPUs. Programs to exploit these architectures need to be written so as to solve multiple parts of the problem in parallel. Earlier computer designs used multiple nodes with one or only a few cores, where the distribution of parallel tasks between nodes was implemented with the MPI programming library.
As the number of cores per node has increased in recent years, an alternative parallelisation strategy has evolved. This uses a hybrid parallelisation model, with MPI for the distribution between nodes and a directive based language (OpenMP) for the parallelisation over multiple cores within the node. In the pure MPI parallelisation the memory on the node is partitioned between the processes, whereas in the hybrid model the memory is shared between the OpenMP tasks. The hybrid model allows fewer processes to be run per node while still exploiting all the cores, allowing calculations which require more memory per process to be solved.
The core part of the code is a self-consistent iteration to solve for the density. Each iteration involves constructing the matrix of interactions between the particles from the current density, diagonalising the matrix and then constructing the updated density from the eigenvectors. The matrix construction can be further partitioned into the two-particle interactions between electrons, the exchange-correlation functional integrals and the one-particle interactions.
CRYSTAL currently has an efficient implementation of the MPI based parallelisation but makes no use of the OpenMP directives within a node. This project has begun the process of updating the code to exploit the hybrid parallelisation model. The matrix diagonalization part of the self-consistent iteration is implemented by calls to the ScaLAPACK library. Modern versions of this library already support the hybrid parallelisation model, so the project has concentrated on updating the core parts of CRYSTAL that construct the matrices.
Within the time constraints of the project we chose to focus on the two most expensive parts of the matrix construction: the two-particle interactions and the exchange-correlation integrals. These routines make extensive use of global variables, in Fortran modules and COMMON blocks. In order to make these routines thread safe it was necessary to identify how these variables were used and convert them to local, stack based, variables. In quite a few cases this meant declaring them on the stack of the top-level driver routine and passing them as parameters to the subroutines calls. In this process the older FORTRAN77 style code, which implicitly declared variables, has been converted to explicitly declare them, and with the concomitant modernisation the code is now much cleaner and accessible to the developers. The largest items are allocated in a per thread data type to avoid the users of the code having to define the stack size. With the code now thread safe, the core part of the two-particle and exchange-correlation driver routines was converted to an OpenMP task to parallelise the code. In the exchange-correlation integrals there are potential conflicts between threads in the final step when they update the matrix, so atomic memory operations needed to be used.
The updated routines were benchmarked on ARCHER using a variety of sized crystal cells of rutile (TiO2) and performed equivalently to the existing MPI version. The two-particle interactions can sometimes perform better than the pure MPI version as they can have better dynamic load balance. The final test was for an Al2O3 grain boundary which can only run on half the cores of an ARCHER node due to its memory requirements. The new hybrid code is able to exploit all the cores and runs >40% faster than the MPI version.
Achievement of objectives
The main objective of the project is to create OpenMP threaded versions of the two-electron integral and DFT exchange-correlation functional routines in CRYSTAL. In the vast majority of calculations these are the most expensive routines.
- Objective 1: the new code must produce the same total energy up to the converged energy tolerance on the standard CRYSTAL test cases as the original code, and in all versions of the code (serial, replicated memory parallel, distributed memory (MPP) parallel).
- Result: This task is complete and successful. The code passes the tests, reproducing the original results.
- Objective 2: The new code will be shown to exploit more cores within a node for test cases which are currently limited by a large per MPI process memory requirement.
- Result: The 3360 benchmark case which runs on only half the cores with the MPI version can now run on the unused cores using OpenMP threads with a >40% performance improvement.
The secondary objective which arises from the second point above is to demonstrate a performance improvement on calculations that currently require under-populated nodes on ARCHER due to excessive memory per MPI process. - Objective 3: Achieving >=10% performance on half populated nodes would be a success as this would result in more productive use of the resource and hence AUs allocated to users.
- Result: The performance of the modified routines, with two threads per MPI process, is competitive with two MPI processes and in some cases performs better in the two-electron integrals due to improved load-balance.
- Objective 4: Optimal success would be for the two-electron integral and DFT xc functional routines on half populated nodes to match the performance of the original code on fully populated nodes.
- Result: As noted above the performance improvement is near optimal.
The work will also benefit the maintainability and sustainability of the code. The routines which will be focused on are amongst the oldest in the entire application, and as such contain some practices deprecated in modern Fortran - Objective 5: As a measure of success the new code will eliminate the use of COMMON blocks in the routines of interest.
- Result: Almost all COMMON blocks have been removed, which has allowed the use of derived types and clearer naming of related data items. The one remaining COMMON is read-only in these routines and 'safe', but the multiple definitions of the array bounds within it make leaving it in its current form less likely to create new bugs.
Summary of the Software
The code modifications of this project are included in the development version of CRYSTAL17, which was available through the CCPForge repository to developers in the UK and is currently being moved. A module can be made available to ARCHER users on request. The development will likely be included in an upcoming release of CRYSTAL17.
Scientific Benefits
CRYSTAL is used by a very wide and varied scientific community in the UK (currently 53 research groups). Research programmes in universities that make use of CRYSTAL include energy generation and storage (eg development of new materials for photovoltaics), catalysis, magnetism, excited states and UV spectroscopy, IR and Raman spectroscopy, low dimensional systems (graphene and nanotubes), materials discovery, and, more recently, homogeneous catalysis and biological systems. A large component of the research activities of the Materials Chemistry Consortium depends crucially on the availability of up-to-date releases of CRYSTAL on ARCHER, and the improvements in the code's use of memory developed in this eCSE work will benefit the larger scale jobs that this community can run.
The immediate impact from this work will be to the existing CRYSTAL user community in the UK and particularly to those who make use of ARCHER through the Materials Chemistry and UKCP consortia. As the science run by these groups expands to larger and more realistic models, the memory limitations of HPC nodes with increasing numbers of cores will become ever more critical. This work modernises the code to remove these restrictions and improves the parallelisation on modern HPC designs. It will allow the community to make more efficient use of the current ARCHER service for the largest calculations and updates CRYSTAL to better exploit the next generation of machine.