Introducing Thread and Instruction Level Parallelism into Ludwig

eCSE01-002

Key Personnel

PI/Co-I: Dr Alan Gray and Dr Kevin Stratford - EPCC, University of Edinburgh

Technical: Dr Alan Gray - EPCC, University of Edinburgh

Relevant documents

eCSE Technical Report (paper submitted to submitted the International Journal of High Performance Computing Applications): A Lightweight Approach to Performance Portability with targetDP

Project summary

Computer simulations are now a vital part of scientific endeavour, alongside the more traditional practices of theory and experiment. When computers were first invented, they contained just a single compute core responsible for performing the necessary arithmetic. This was not sufficient for the most demanding of problems, so parallel computers comprising lots of cores emerged. Programming models necessary to utilise them such as the Message Passing Interface (MPI), which allows the processes running on each core to communicate with one another, were developed. Applications which solely rely on MPI have been tremendously successful over the years, but due to further developments in hardware, more sophisticated models are now required alongside MPI. Cores are no longer getting faster because of fundamental physical limits, so parallelism is now being exploited as far as possible. Computer hardware is now rapidly evolving towards hierarchical parallelism: there exist multiple nodes, where within each node there are multiple cores (that can all access the same memory space), and each core can perform multiple operations at one time. Furthermore, new chips such as NVIDIA Graphics Processing Units (GPUs) and Intel Xeon Phi manycore CPUs, which offer tremendous performance advantages, also feature more complex hierarchical memory systems than traditional systems.

New sophisticated programming models are required, in combination with MPI, to allow the software applications to keep up. This project was concerned with applying a new custom-made programming model, targetDP, to the Ludwig scientific application to allow it to perform as well as possible not just on ARCHER but across the range of emerging hardware platforms. Ludwig is a software package designed to be able to simulate a wide variety of soft matter substances, with examples including everyday items such as foodstuffs, cosmetic items, oils and liquid crystals (LCs), where the latter is an important topic of current research. LCs are widespread in technology (including displays and other optical devices) and also in nature, but much is yet to be understood about the range of possible LC configurations. Simulations are vital in paving the way to improved knowledge and exciting new applications. Particularly interesting is inclusion of relatively large particles within LC systems, where we aim to tune system properties such that the LC can act as a template to guide particles in self assembly, allowing for new substances with special optical properties. However, we need to properly resolve the structure whilst having a large enough simulation to include enough particles: this is extremely computationally demanding.

polarising optical microscopy image
A polarising optical microscopy image from an experiment showing 3 colloidal particles interacting with cholesteric LC.

In this project we adapted Ludwig so that it uses not only MPI but also targetDP, which is designed to target Data Parallel hardware in a platform agnostic manner, by abstracting the hierarchy of hardware parallelism and memory systems, in a way which can suitably map on to each of the main choices of modern hardware systems.

The same application source code can now be compiled not only for ARCHER, but also for the other modern architectures mentioned above, where performance is optimal in all cases. Furthermore, the code now runs 1.5 times as fast on ARCHER itself now, and for certain system types allows previously impossible simulations (in particular those involving large particles), since the software more closely matches the hierarchical parallelism of the underlying hardware. This work will allow soft matter researchers to make faster progress than previously possible, to further the understanding of LC systems and beyond, on ARCHER and other emerging hardware resources. We have also extended the use of this new model beyond Ludwig to other applications.

TargetDP programming model
The targetDP programming model allows the same application source code to be performance portable across the range of modern hardware platforms.

Achievement of objectives

Our proposal stated:

"We therefore propose to adapt Ludwig to partition the inherent lattice5based data5 parallelism into 3 distinct levels: task5level, thread5level and instruction5level, an abstraction close to the hardware. We do this by introducing a new clearly defined software level, which we call targetDP (DP for "Data Parallel"). The new abstraction promotes optimal mapping of code to hardware thread5level and instruction5level parallelism through introduction of OpenMP or CUDA for threads and SIMD parallel loops. targetDP will improve sustainability of our code by allowing us to maintain a single source code with portable performance on present and future supercomputers. The work will be made available to the community via a forthcoming public release of Ludwig, and we will strive to facilitate the use of our newly developed techniques in other data parallel applications."

We have fully met these objectives: targetDP is now implemented in Ludwig such that it not only performs better on ARCHER but is also performance.portable to other modern architectures. We have now enabled:

  • Performance around 50% faster on ARCHER due to better vectorization and removal of cache bandwidth bottlenecks through loop unrolling (see section 2.1).
  • OpenMP threading (abstracted through targetDP) within each node on ARCHER (retaining MPI for inter-node communication). This allows previously impossible simulations, due to the fact that much larger subdomains (than with pure MPI) are now possible.
  • Performance portability of the same source code to both NVIDIA GPU and Intel Xeon Phi architectures.

We have created a new Ludwig website (ludwig.epcc.ed.ac.uk), which includes instructions on how to obtain and use the software (which is openly available on CCPForge). We have (through effort funded by an external PRACE project) extended use of targetDP to a separate Lattice QCD benchmark code (see the attached paper), and have supported others in exploratory work adopting the model in other codes (including acoustic modeling and another lattice Boltzmann code).

Summary of the Software

The Ludwig software is openly available at https://ccpforge.cse.rl.ac.uk/gf/project/ludwig.

As part of this project, we have created a new Ludwig website (http://ludwig.epcc.ed.ac.uk) to act as a hub for existing and new Ludwig users. This website gives information (or links to information) on:

  • How to obtain the software
  • How to use the software
  • The new targedDP model
  • How to get in contact for more information.

We have created a "Quickstart on ARCHER" guide (http://ludwig.epcc.ed.ac.uk/software) which allows users to very quickly obtain, build and run the code on ARCHER.