ARCHER KNL Performance Reports

This page contains a summary of the findings from the various KNL Performance Reports and all the KNL Performance Reports.

The KNL Performance Reports were written by ARCHER users and the ARCHER CSE Service at EPCC and compare the performance of standard ARCHER compute nodes to ARCHER KNL nodes for a variety of applcations and benchmarks.

Summary of results and advice

Luis Cebamanos, ARCHER CSE Team, EPCC

The recent addition of the ARCHER Knights Landing (KNL) testing and development platform opens new opportunities for optimizing applications to one of the most advanced manycore devices available today. Here we present and summarise the performance evaluation of a group of applications run on the ARCHER-Xeon and the ARCHER-KNL system. This group of applications comprises CFD codes (ICOMPACT3D, COSA, SENGA2, OpenSBLI and HLBM), molecular dynamics (MD) codes (LAMMPS, NAMD and CP2K), forward dynamics modeling codes (GaitSym) and plasma modeling codes (GS2).

Objectives

The main of objectives of this study are:

  • Analyze the performance of different applications run on both ARCHER–Xeon and ARCHER–KNL systems.
  • Compare results obtained of different user cases.
  • Provide configuration and optimization advice based on the performance results obtained here.

Hyperthreading

Although it is not a general rule, the use of multiple hyperthreads on the ARCHER-KNL system seems to give more performance benefits. We have seen this effect on SENGA2, OpenSBLI, LAMMPS and NAMD. In other applications, hyperthreading has not given any boost in performance. This is the case of COSA or CP2k. The use of hyperthreads although not always the best option does not seem to deteriorate the performance. When hyperthreading aids to improve the application’s performance, the best choice normally varies between 2 or 4 hyperthreads. The choice between 2 or 4 hyperthreads mostly depends on the user test case employed although we have also seen variation with the number of nodes, this is the case of NAMD or LAMMPS.

Hybridisation

Hybrid codes are more likely to perform well on the ARCHER-KNL system. With the exception of CP2K, the codes that implement a hybrid model (MPI+ threads) have demonstrated to perform better on the ARCHER-KNL system than on the Xeon system, particularly in CFD applications. We have seen this effect on LAMMPS, OpenSBLI and NAMD. The boost in performance most likely depends on the user case, being more beneficial in problems where fewer MPI processes allows more memory usage per process. The right number of threads would also be very application dependent, but traditional hybrid MPI+OpenMP models seem to obtain peak performance with 2 or 4 threads per process.

Energy consumption

The ARCHER-KNL system has demonstrated to provide considerably lower figures of energy consumption compare to the ARCHER-Xeon system. We have seen with applications like OpenSBLI or LAMMPS that about half of the energy was consumed by the KNL system as a result of a faster simulation. However, it would also be interesting to compare these figures with more recent hardware than the ARCHER Xeon Ivy Bridge processors.

Cache

The cache effect has been seen on several applications benchmarked here. This is the case of COSA or HLMB where superlinear scaling has been achieved. In general, applications that properly use the cache on KNL systems or that data used in a test case fits into the MCDRAM would see a significant performance benefit.

Drop in performance

Although a considerable number of the tested applications experienced a boost in performance running on a single ARCHER-KNL node compared to a single Xeon node, a few of them have also shown a fall in performance gains as the number of nodes where increased. We have seen in applications like NAMD or SENGA2. A possible reason for this could be that certain test cases do not provide enough computation per node to use the additional compute resource available as the node count increases.

Performance comparison

As previously indicated, most applications have shown certain level of performance benefit running on ARCHER-KNL system compared to the Xeon system. Having said that, this performance benefit is present if the comparison is done node to node: reporting the performance figures on a given number of nodes. On the other hand, core to core comparisons always show the ARCHER-Xeon system as having higher performance by almost a factor of 2. The reason for this is most likely because of the difference clock rate being 1.3GHz for the KNL cores and 2.7GHz for the Xeon cores.

Peak performance

Performance in terms of percentage of peak performance has been measured for the LAMMPS application. This reported that to be considerably higher for the ARCHER-Xeon system compared to the ARCHER-KNL. The most likely reason for this is that vectorisation has not been fully exploited by the user cases employed. This obviously highlights the importance of vectorisation on the KNL system to make the most f the computational performance on offer.

KNL MCDRAM configuration mode

The dominant performance preference for the configuration of the KNL nodes seems to be quad_100 where all 16GB MCDRAM memory is allocated to be used as cache memory. The HLMB and COSA application demonstrated that quad_0 nodes were as twice as much slower as the quad_100 nodes if the MCDRAM memory was not manually employed (using a method such as numactl).

Individual KNL Performance Reports