Distributed Hamiltonian build and diagonalisation in UKRMol+
eCSE08-007Key Personnel
PI/Co-I: Professor Jonathan Tennyson - University College London
Technical: Ahmed Al-Refaie - University College London
Relevant Documents
eCSE Technical Report: Distributed Hamiltonian build and diagonalisation in UKRMol+
Project summary
Electron collisions with molecules are a key process in many areas - from lightning to car ignitions, from radiation damage in living systems to processes that initiate and govern most technological plasma processes. The UKRMol codes utilize the R-matrix method to simulate these collisions by partitioning space into two regions: An inner region where the electron is close to the molecule, and an outer region where the electron is far from the molecule.
One of the most computationally-demanding aspects of the simulation is the building and diagonalization of the inner region scattering Hamiltonian. This is handled by the FORTRAN77 code SCATCI in the UKRMol pipeline. The serial nature of the code has hit a computational limit as the demands for more accurate and complicated simulations and methods such as the R-matrix with psuedo states (RMPS) which capture many physically important resonances now require the building of matrices of sizes in the region of 1,000,000x1,000,000. Such a simulation would take almost 500 hours to accomplish at the time.
The goal of the project was to completely rewrite the inner region Hamiltonian build and diagonalizer code SCATCI to enable its usage on distributed computing clusters. The rewrite was chosen as opposed to simply a port as the Fortran 77 used limits its readability and expandability for future work. Modern Fortran (2003+) was chosen to allow current users to continue maintaining the code whilst leveraging the powerful features of object-oriented programming (OOP).
One example of this is the symbolic matrix build. SCATCI defers evaluation of integrals by instead using symbolic integers to represent an integral. This required manually managing several arrays in a calculation and if multiple symbols are required, multiple arrays on top of that. The OOP paradigm has now completely eliminated this requirement by transforming these into a class that does the managing for the user. Their usage matches closely to their mathematical formulism and has allowed them to be easily parallelised. A user can simply include this class, logically knowing of its functions and that it is appropriately parallel in nature.
The Hamiltonain build is split into several classes based on which matrix elements are calculated. Each class can require three steps to its evaluation: Prototyping, where a set of symbols are generated from Slater-Condon rules for a single matrix element, Contraction, where the prototypes are contracted against the target states reducing the matrix size and Expansion where the contracted set of symbols are used to generate the rest of the matrix elements of a particular class. Each of these stages has been parallelized heavily with only a small number of synchronization steps required to complete the first two steps. This has achieved almost linear scaling with core count with the Hamiltonian build time for a 100,000x100,000 reducing from 2.2 hours to 6 minutes on a single node. Further node counts improve the time to the point where minimum time required performing writing of the elements to file now constitutes the majority of computational time. Bigger problems achieve better scaling.
Finally, the OOP nature of the new code has made it easy to support a variety of serial and parallel diagonalizers for standard R-Matrix calculations, such as LAPACK and SCALAPACK, and partitioned-R-Matrix calculations such as ARPACK and SLEPc. The diagonalizers can be mixed and matched in a single run with a possible calculation beginning with SLEPc and ending with SCALAPACK. A user can easily add a new diagonalizer or matrix post-processor by using the powerful new DistributedMatrix and Diagonalizer classes. These classes require the user to define two subroutines that describe which matrix element belongs to which processor and how to call the library routines and it will automatically distribute and diagonalize the matrix without any more user input.
Previously the RMPS methodology for large molecules was computationally expensive as it required the building and diagonalization of matrices in excess of 1,000,000x1,000,000. Building this matrix using SCATCI would take almost a month to achieve and a standard simulation requires multiple types of the matrices to be built. With MPI-SCATCI this is not only possible within a day but has actually been achieved with a recent RMPS calculation of this size with 8 ARCHER nodes taking only 4 hours to build, saving the user a significant amount of time.
Achievement of objectives
The project had four main objectives:
Rewrite of the SCATCI using Modern Fortran:
SCATCI was fully rewritten and extensively utilizes object oriented features of Modern Fortran. This meant that the code was not only more readable, matching the look of the mathematics much more closely, but could also be extended easily to include new features such as integrals, diagonalizers and matrices through inheritance as well as easily adapt its inputs and outputs to any changes in the UKRMOL+ pipeline with minimal effort.
Implementation of parallel Hamiltonian build:
By all metrics this objective was achieved. The parallel build speed up scales almost linearly with increasing core count up to a point where the build time approaches the system overhead (e.g. IO).
One of the unexpected benefits from this implementation was a significant improvement to the single core performance of the build phase. (1.5—4x speed up).
Integral gather-scatter algorithm:
This was dropped as performance would most likely suffer from excessive communication. Instead MPI-3.0 shared memory was used to allow for large integrals up to 128GB to be used on ARCHER with no measured drop in computational efficiency. This feature was so useful that it was also added to other UKRMOL+ codes in the pipeline.
Hamiltonian diagonalization:
MPI-SCATCI implements ARPACK, LAPACK, SCALAPACK and SLEPc and can easily switch between them in a single run. New serial and MPI diagonalizers can be easily added by inheriting from the provided matrix and diagonalizer classes without touching any code related to the Hamiltonian build.
Summary of the software
MPI-SCATCI is an R-matrix inner region code that builds and diagonalizes the electron-molecule scattering Hamiltonian for both the UKRMOL and UKRMOL+ pipelines.
The code is available from subversion at CCPFORGE (registration required). The revised module, which is much quicker than the old version even for single core applications, has been fully tested and is now part of the trunk distribution for these codes.