7. Debugging

Note that the usefulness and accuracy of the information within any debugger depends on your compilation options. If you have optimisation switched on then you may find that the line numbers listed in the debugging information do not correspond with the statements in your source code file. For debugging code we always recommend that you compile with optimisation switched off and the -g flag enabled to provide the most accurate information.

You may want to use an interactive session whilst debugging, in which case you are advised to also consult the section in the user guide on interactive jobs and productivity tips.

7.1 Available Debuggers

ARCHER has Cray ATP, DDT and lgdb installed.

7.2 Cray ATP

Cray ATP (Abnormal Termination Processing) is a tool that monitors your application and, in the event of an abnormal termination, it will collate the failure information from all the running processes into files for analysis.

With ATP enabled, in the event of abnormal termination, all of the stacktraces are gathered from the dying processes, analysed and collated into a single file called atpMergedBT.dot. In addition the stacktrace from the first process to die (hence the probable cause for the failure) is delivered to stderr.

The atpMergedBT.dot file can be viewed using the stat-view command that is accessible by loading the stat module.

7.2.1 ATP Example

To enable ATP you should load the atp module in your job submission script and set the "ATP_ENABLED" environment variable to 1. i.e. you should include the following commands in your (bash) job submission script:

module load atp
export ATP_ENABLED=1

and then run your job using aprun as usual. Once your application has terminated abnormally you need to log into the service while exporting the X display back to your local machine (you must have an X server running locally) with:

ssh -Y username@archer.ac.uk

Load the stat module with:

module add stat

and view the merged stacktrace with:

stat atpMergedBT.dot

The stderr from your job should also contain useful information that has been processed by ATP.

7.3 STAT

The Stack Trace Analysis Tool (STAT) is a cross-platform debugging tool from the University of Wisconsin-Madison. ATP is based on the same technology as STAT, both are designed to gather and merge stack traces from a running application's parallel processes. The STAT tool can be useful when application seems to be deadlocked or stuck, i.e. they don't crash but they don't progress as expected, and it has been designed to scale to a very large number of processes. Full information on STAT, including use cases, is available at the STAT website.

STAT will attach to a running program and query that program to find out where all the processes in that program currently are. It will then process that data and produce a graph displaying the unique process locations (i.e. where all the processes in the running program currently are). To make this easily understandable it collates together all processes that are in the same place providing only unique program locations for display.

7.3.1 STAT Example

YouTube video from the ARCHER CSE team demonstrating STAT:

To use the STAT tool you need to run an interactive job. To do this, we recommend using

qsub -I 

In partciluar, do not use the '-V' option; this exports the login environment to the interactive job which can cause problems when connecting STAT to the running job. Add '-X' if using 'stat-view' from within the interactive job.

Once you've launched your interactive job and navigated to the /work directory where you will run you code you need to load the STAT module as follows:

module load stat

Then you simply launch your job as normal, but run it as a background task, for example the following text will run an executable called my_exe using 512 processes. The & symbol runs the application in the background:

aprun -n 512 -N 24 ./my_exe &

Now you need to discover the program ID of the job you have just run. Use the following command to do this:

ps 

This should present you with a set of text that looks something like this:

  PID TTY          TIME CMD
21704 pts/0    00:00:00 bash
21868 pts/0    00:00:00 aprun
21871 pts/0    00:00:00 aprun
21879 pts/0    00:00:00 aprun
21884 pts/0    00:00:00 ps

When your application has reached the point that it hangs issue the following command (replacing PID below with the number of second aprun task you got when you ran the ps command as outlined above):

stat-cl PID

Once STAT has finished working you can kill your aprun job using the following command (again replacing PID as you did for the STAT command):

kill -9 PID

Now you can view the result that STAT has produced using the following command (exe is replaced with the name of the executable you ran):

stat-view stat_results/exe.0000/exe.0000.3D.dot

This should produce a graph displaying all the different places in the program that the parallel processes were at when you queried them. If you have problems viewing the graph it is likely you have not exported your X display when you logged into ARCHER or when you submitted your interactive job. Viewing the graph does not need to be done through an interactive job so you can quit the interactive job at this point and view the graph from the normal ARCHER login nodes.

7.4 GDB (GNU Debugger)

The standard GNU debugger: GDB is available on Cray XC systems. The debugger currently only supports the command line interface.

There are two components that you must use to debug your parallel program using GDB:

  • The 'lgdb' program which launchers gdbserver processes on the login nodes. This is available by loading a cray-lgdb module on the system.
  • The 'gdb' program which connects to the remote program instances (started using 'lgdb') and provides the debugging command line interface.

When you execute your program using 'lgdb' the system will provide instructions on how to connect to the gdbserver process to debug your program. By default, on ARCHER, the output from STDOUT is only delivered once the job is completed so you must redirect the output so that you have the required information to connect to the process. See the example below for details on this.

7.4.1 Launching your program using 'lgdb'

The 'lgdb' command (available after loading a cray-lgdb module) is used to launch your program and attach a gdbserver process to enable debugging. If you are running interactively, then the syntax for launching a 64 task job and debugging parallel process 0 would be:

lgdb --pes=0 --command="aprun -n 48 my_parallel_program.x"

This command will yield instructions on how to connect the 'gdb' process that will look something like:

user@login1:/work/user/debug> less stdout.txt 
sending /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/lgdbd... completed
sending /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdbserver... completed

*** create a new window and load the correct lgdb module for each target
*** run gdb from the following path:
/opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdb [PATH-TO-YOUR-APPLICATION]

*** the following gdb target commands should be used in separate windows
*** [Pe=0] to debug this Pe type the following in gdb
target remote nid00003:10000

If you do not have access to interactive access and need to run in batch mode then you simply replace the normal aprun command in your job submission script with the call to 'lgdb' and redirect STDOUT to a file. For example:

lgdb --pes=0 --command="aprun -n 48 my_parallel_program.x" > stdout.txt

You must redirect STDOUT to a file in this way so you can access the information printed above on how to connect to the gdbserver from the 'gdb' program.

7.4.2 Debugging the remote gdbserver using 'gdb'

Once you have your compute process running with an associated gdbserver using the 'lgdb' command as specified above then you can start the GNU debugger on the command line on the login node with a command such as:

user@login1:/work/user/debug> \\
  /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdb my_parallel_program.x

This will give you the '(gdb)' prompt where you can enter the command to link to the gdbserver process to start debugging. For example:

(gdb) target remote nid00003:10000

Now you can use gdb in the same way as you would if you were debugging a local program.

7.4.3 Useful GDB commands

Please see the documentation for GNU debugger documentation for a full list of the gdb commands. Some of the most often used commands are listed below.

Note: that pressing 'ctrl-c' with the program running while in GDB will cause the program to halt and print a backtrace. You can use this to identify problematic areas of the code.

  • break function_name - (or b) insert breakpoint at start of specified function
  • break file:/line_number/ - insert breakpoint at line number in specified file
  • continue - (or c) continue runnning program until next breakpoint is reached
  • next - (or n) step to next line of program (will also step into subroutines)
  • list - (or l) list source code around current position
  • list start_line,/end_line/ - list source code from start_line to end_line in current function.
  • print variable_name - (or p) print the value of the specified variable
  • print array_name/(/index) - print value at specified index of 1D array
  • print array_name/(/index1,/index2/) - print value at specified index of 2D array
  • print array_name/(/index)@/elements/ - print elements values from the array starting at index.
  • ptype variable_name - print information on the variable type and array dimensions (if this is an array).
  • quit - (or q) quit gdb and halt the running program.

7.4.4 Example: debugging an MPI program using GDB

This example illustrates the debugging of the VASP 5 code.

First, you must compile your program with debugging symbols (-g flag). You should also usually ensure that optimisation is turned off (-O0 flag) so that reordering of source code lines does not take place. (Of course, it may sometimes be necessary to include optimisation if this is the cause of the problems.)

In this example we will assume that you are running without interactive access to the compute nodes. Write a job submission script for your job in the usual way but with the following changes: you should load the 'cray-lgdb' module and you replace the standard aprun line with a call to 'lgdb' that contains your aprun command and which redirects STDOUT. For example:

#!/bin/bash --login
#PBS -N vasp_debug

# Number of MPI processes
#PBS -l select=2

# Walltime for the debug job
#PBS -l walltime=1:0:0

# Your account code
#PBS -A z01

# Add the Cray GDB module
module add cray-lgdb

# Location of the VASP 5 executable
export VASP_EXEDIR=/work/user/software/VASP/bin

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Change to the directiry the job was submitted from
cd $PBS_O_WORKDIR

# Start the gdbserver with our parallel job.
#   We make sure we redirect STDOUT (to stdout.txt) so we can access
#    the information needed to attach to the remote gdbserver
#   We also use the --pes=0 option to start a single gdbserver instance
#    attached to the first MPI task
lgdb --pes=0 --command="aprun -n 48 $VASP_EXEDIR/vasp" > stdout.txt

You should then submit this job in the usual way. Once the job is running, you will be able to inspect the contents of the 'stdout.txt' file to get the ID of the server to attach to using GDB. For example:

user@login1:/work/user/debug> less stdout.txt 
sending /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/lgdbd... completed
sending /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdbserver... completed

*** create a new window and load the correct lgdb module for each target
*** run gdb from the following path:
/opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdb [PATH-TO-YOUR-APPLICATION]

*** the following gdb target commands should be used in separate windows
*** [Pe=0] to debug this Pe type the following in gdb
target remote nid00003:10000

This tells us the 'gdb' binary to use and indicates that we should use GDB to target the remote gdbserver at 'nid00003:10000'. On the login node command line run the specified 'gdb' executable:

user@login1:/work/user/debug> \\
  /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdb $VASP_EXEDIR/vasp

dlopen failed on 'libthread_db.so.1' - /lib64/libthread_db.so.1: \\
  undefined symbol: ps_lgetfpregs
GDB will not be able to debug pthreads.

GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
(gdb) 

and then target the remote gdbserver with the command specified in the 'stdout.txt' file:

(gdb) target remote nid00003:10000
Remote debugging using nid00003:10000
[New Thread 22131]
0x00000000012aed60 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
82      ../sysdeps/unix/syscall-template.S: No such file or directory.
        in ../sysdeps/unix/syscall-template.S
Current language:  auto; currently asm

Now we can add a breakpoint at one of our program functions and proceed to it. For example:

(gdb) b force_and_stress_
Breakpoint 1 at 0x87f168: file ./force.f, line 1160.
(gdb) c
Continuing.

Once the program has reached the specified breakpoint we can start debugging. To see the current backtrace of where we are in the program:

Breakpoint 1, force_and_stress_ (kineden=Cannot access memory at address 0x0
) at ./force.f:1160
1160          CALL START_TIMING("G")
Current language:  auto; currently fortran
(gdb) bt
#0  force_and_stress_ (kineden=Cannot access memory at address 0x0
) at ./force.f:1160
#1  0x000000000041ad48 in vamp () at ./main.f:2665
#2  0x00000000004008e0 in main ()
#3  0x0000000001374d14 in __libc_start_main (main=0x4008a0 <main>, argc=1, ubp_av=0x7fffffffb548, 
    init=0x1375200 <__libc_csu_init>, fini=0x13751c0 <__libc_csu_fini>, rtld_fini=0, stack_end=0x7fffffffb538)
    at libc-start.c:226
#4  0x00000000004007a9 in _start () at ../sysdeps/x86_64/elf/start.S:113

We can list the source code lines and add another breakpoint further into the routine by line number:

(gdb) l 1160,1180
1160          CALL START_TIMING("G")
1161    
1162          DO ISP=1,WDES%NCDIJ
1163             CALL RC_ADD(CHTOT(1,ISP),1.0_q,CHTOT(1,ISP),0.0_q,CHTOTL(1,ISP),GRIDC)
1164          ENDDO
1165          IF (LDO_METAGGA().AND.LMIX_TAU()) THEN
1166             DO ISP=1,WDES%NCDIJ
1167                CALL RC_ADD(KINEDEN%TAU(1,ISP),1.0_q,KINEDEN%TAU(1,ISP),0.0_q,KINEDEN%TAUL(1,ISP),GRIDC)
1168             ENDDO
1169          ENDIF
1170          RHOLM_LAST=RHOLM
1171    
1172          IF (INFO%LCHCON .OR. INFO%LCORR) THEN
1173             CALL SET_CHARGE(W, WDES, INFO%LOVERL, &
1174                  GRID, GRIDC, GRID_SOFT, GRIDUS, C_TO_US, SOFT_TO_C, &
1175                  LATT_CUR, P, SYMM, T_INFO, &
1176                  CHDEN, LMDIM, CRHODE, CHTOT, RHOLM, N_MIX_PAW, IRDMAX)
1177    
1178             CALL STOP_TIMING("G",IO%IU6,'CHARGE')
1179          ENDIF
1180    !----------------------- FORCES ON IONS    -----------------------------
(gdb) b ./force.f:1172
Breakpoint 2 at 0x87f37a: file ./force.f, line 1172.

and then proceed to this breakpoint:

(gdb) c
Continuing.

Breakpoint 2, force_and_stress_ (kineden=Cannot access memory at address 0x0
) at ./force.f:1172
1172          IF (INFO%LCHCON .OR. INFO%LCORR) THEN

Now we can examine the values of some of the variables

(gdb) ptype info%lchcon
type = logical
(gdb) p info%lchcon
$1 = .FALSE.
(gdb) ptype rholm
type = double precision (0,0)
(gdb) p rholm(1,1)
$2 = 0.051804883959039337
(gdb) p rholm(1,1)@3
$3 = (0.051804883959039337, 0.0083683781999898572, -0.0018751730313048671)

The last expression shows the next 3 array element values of rholm starting at (1,1).

Once you have finished debugging you can kill the running program and quit the debugger with the 'quit' command:

(gdb) q
The program is running.  Exit anyway? (y or n) y

7.5 DDT Debugger

DDT is a debugging tool for scalar, multi-threaded and large-scale parallel applications.

Use Version 4.2.2_39977 of the User Guide for the default version of DDT on Archer. For more information on using DDT see

7.5.1 Download and install the remote client

The recommended way to use DDT on ARCHER is to install the free Allinea Forge remote client on your workstation or laptop using these instructions.

Once you have installed the remote client, the instructions below describe how to compile and debug a simple executable.

7.5.2 Compile the code for debugging

To compile the code to be debugged you should install the source code on the /work filesystem and compile the executable into a location on /work to ensure that the running job can access all of the required files.

You will also usually want to specify the -O0 option to turn off all code optimisation (as this can produce a mismatch between source code line numbers and debugging information) and -g to include debugging information in the compiled executable.

For example, using the simple MPI code from the ARCHER Quick Start Guide we would compile with:

auser@eslogin01:/work/x01/x01/auser> ftn -O0 -g -o hello_world.x hello_world.f90

7.5.3 Set up the debugger to submit jobs to ARCHER

We must now tell the remote client how to submit jobs to the ARCHER job submission system. You should only need to configure this once and the client will remember for future debugging sessions.

On the main DDT interface, click "Options" and on the dialog box that appears, select "Job Submission" from the list on the left. Ensure that the settings are set up as illustrated below and click "OK":

(The path to the Submission template file is /home/y07/y07/cse/allinea/templates/archer.qtf.)

Submit command
qsub
Cancel command
qdel
Display command
qstat

7.5.4 Run your debugging session on your program

Now everything is configured we can debug our program. On the main DDT interface click "Run". This will bring up a dialogue where you can specify the path to your executable and other options such as the number of processors to use and the walltime for the job. An example of the dialog is shown below with dummy values completed for the executable name and the working directory. For our small example we are just using a single node (24 cores) and running for just 10 minutes (so we can use the "short" queue).

Note: to use the short queue your job must have a maximum run time of 20 minutes. If you wish to run for longer you should remove the queue specification so that you run in the standard ARCHER queue.

Once all the options have been set up you can submit your debugging session to the ARCHER queues by clicking "Submit".

A dialog showing the ARCHER queue will appear while the tool waits for your job to start. Note: you may see the warning message below which may be safely ignored.

pbs_iff: cannot connect to host
pbs_iff: cannot connect to host
No Permission.
qstat: cannot connect to server sdb (errno=15007)

Once the job starts the a dialog will appear while the debugger connects to your running processes.

Finally, the debugging interface will appear, allowing you to interactively debug your program.

7.5.5 Finishing your debugging session

To finish the debugging session, just quit the remote client on your workstation or laptop; DDT will ensure that the session is cleaned up properly.

7.5.6 Using DDT directly on the compute nodes

If you intend to use DDT directly on the compute nodes instead of using the remote client, you will need to load the allinea module before compiling and linking your program, and before executing your program on the compute nodes:

module load allinea

The User Guide gives instructions on how to compile and execute your program, and the command

ddt -help

lists the options for the ddt command. Please contact the ARCHER helpdesk for assistance with using DDT directly on the compute nodes.

7.5.7 Memory debugging of statically-linked programs

When using memory debugging with statically-linked programs, the debugging version of the malloc library needs to be included when the program is linked.

Load the allinea module

module load allinea

and add the following arguments to the command line when linking the program with the compiler wrapper

-L $ALLINEA_TOOLS_DIR/lib/64 -Wl,--whole-archive -ldmallocthcxx -Wl,--no-whole-archive -Wl,--allow-multiple-definition

The standard malloc library is usually linked by the compiler wrapper. This is replaced with the debugging version. Using whole-archive ensures that any libraries automatically loaded by the compiler wrapper use the debugging version. Using allow-multiple-definition ensures that the standard malloc library is ignored.

With the current version of DDT on Archer (Version 4.2.2_39977), care needs to be taken to avoid preloading the dynamic version of the dmalloc library when debugging a statically-linked program: if DDT attempts to preload the library, the session will hang.

In ~/.allinea/system.config make sure that the line

preload =

is just that, with nothing assigned to preload. Edit the file if required (or it is possible to change this using the DDT GUI, see below).

Start your DDT session. After ticking the 'Memory Debugging' box, the configuration screen will look like

If you decide to change some things in the 'Memory Debugging' entry by pressing 'Details...' in that entry, ALWAYS untick the 'Preload the memory debugging library' box before pressing 'OK'

You can use this to reset the preload= line in ~/.allinea/system.config rather than editing the file.