7. Debugging

Note that the usefulness and accuracy of the information within any debugger depends on your compilation options. If you have optimisation switched on then you may find that the line numbers listed in the debugging information do not correspond with the statements in your source code file. For debugging code we always recommend that you compile with optimisation switched off and the -g flag enabled to provide the most accurate information.

You may want to use an interactive session whilst debugging, in which case you are advised to also consult the section in the user guide on interactive jobs and productivity tips.

7.1 Available Debuggers

ARCHER has Cray ATP, DDT and lgdb installed.

7.2 Cray ATP

Cray ATP (Abnormal Termination Processing) is a tool that monitors your application and, in the event of an abnormal termination, it will collate the failure information from all the running processes into files for analysis.

With ATP enabled, in the event of abnormal termination, all of the stacktraces are gathered from the dying processes, analysed and collated into a single file called atpMergedBT.dot. In addition the stacktrace from the first process to die (hence the probable cause for the failure) is delivered to stderr.

The atpMergedBT.dot file can be viewed using the stat-view command that is accessible by loading the stat module.

7.2.1 ATP Example

To enable ATP you should load the atp module in your job submission script and set the "ATP_ENABLED" environment variable to 1. i.e. you should include the following commands in your (bash) job submission script:

module load atp
export ATP_ENABLED=1

and then run your job using aprun as usual. Once your application has terminated abnormally you need to log into the service while exporting the X display back to your local machine (you must have an X server running locally) with:

ssh -Y username@archer.ac.uk

Load the stat module with:

module add stat

and view the merged stacktrace with:

stat-view atpMergedBT.dot

The stderr from your job should also contain useful information that has been processed by ATP.

Please note, Cray ATP should only be used in circumstances when the application code has been forcibly aborted, such as a segmentation fault. Aborts initiated from within the application code itself will not be captured by Cray ATP and so no atpMergedBT.dot file will be generated.

7.3 STAT

The Stack Trace Analysis Tool (STAT) is a cross-platform debugging tool from the University of Wisconsin-Madison. ATP is based on the same technology as STAT, both are designed to gather and merge stack traces from a running application's parallel processes. The STAT tool can be useful when application seems to be deadlocked or stuck, i.e. they don't crash but they don't progress as expected, and it has been designed to scale to a very large number of processes. Full information on STAT, including use cases, is available at the STAT website.

STAT will attach to a running program and query that program to find out where all the processes in that program currently are. It will then process that data and produce a graph displaying the unique process locations (i.e. where all the processes in the running program currently are). To make this easily understandable it collates together all processes that are in the same place providing only unique program locations for display.

7.3.1 STAT Example

YouTube video from the ARCHER CSE team demonstrating STAT:

To use the STAT tool you need to run an interactive job. To do this, we recommend using

qsub -I 

In partciluar, do not use the '-V' option; this exports the login environment to the interactive job which can cause problems when connecting STAT to the running job. Add '-X' if using 'stat-view' from within the interactive job.

Once you've launched your interactive job and navigated to the /work directory where you will run you code you need to load the STAT module as follows:

module load stat

Then you simply launch your job as normal, but run it as a background task, for example the following text will run an executable called my_exe using 512 processes. The & symbol runs the application in the background:

aprun -n 512 -N 24 ./my_exe &

Now you need to discover the program ID of the job you have just run. Use the following command to do this:

ps 

This should present you with a set of text that looks something like this:

  PID TTY          TIME CMD
21704 pts/0    00:00:00 bash
21868 pts/0    00:00:00 aprun
21871 pts/0    00:00:00 aprun
21879 pts/0    00:00:00 aprun
21884 pts/0    00:00:00 ps

When your application has reached the point that it hangs issue the following command (replacing PID below with the number of second aprun task you got when you ran the ps command as outlined above):

stat-cl PID

Once STAT has finished working you can kill your aprun job using the following command (again replacing PID as you did for the STAT command):

kill -9 PID

Now you can view the result that STAT has produced using the following command (exe is replaced with the name of the executable you ran):

stat-view stat_results/exe.0000/exe.0000.3D.dot

This should produce a graph displaying all the different places in the program that the parallel processes were at when you queried them. If you have problems viewing the graph it is likely you have not exported your X display when you logged into ARCHER or when you submitted your interactive job. Viewing the graph does not need to be done through an interactive job so you can quit the interactive job at this point and view the graph from the normal ARCHER login nodes.

7.4 GDB (GNU Debugger)

The standard GNU debugger: GDB is available on Cray XC systems. The debugger currently only supports the command line interface. For the latest Cray information re 'lgdb' search for lgdb in The Cray Technical Documentation, and choose the XC Series Programming Environment User Guide result. You will see that Cray's own documentation corresponds to the lgdb help command"!

There are two components that you must use to debug your parallel program using GDB:

  • The 'lgdb' program which launchers gdbserver processes on the login nodes. This is available by loading a cray-lgdb module on the system.
  • The 'gdb' program which connects to the remote program instances (started using 'lgdb') and provides the debugging command line interface.

7.4.1 Launching your program using 'lgdb'

The 'lgdb' command (available after loading a cray-lgdb module) is used to launch your program and attach a gdbserver process to enable debugging. Use an interactive session whilst debugging (see the section in the user guide on interactive jobs and productivity tips). At the lgdb command prompt (dbg all>), use the launch command to launch an application . For example,to run the application my_parallel_program.x that is in the current directory:

lgdb $job{48} my_parallel_program.x"

$job48 defines a ’process set’ here called $job with 48 ’processes’.You choose the name of the process set, and there can be several process sets, with different names. Additional options for aprun can be passed using —launch-args=”. For example, to launch a hybrid MPI-OpenMPjob, use

export OMP_NUM_THREADS=8
lgdb
launch $job{4} --launch-args="-j 1 -N 2 -d 8 -S 1 -ss" ./my_parallel_program.x

Environment variables for the application can also be set within lgdb. Please see the man page for lgdb and also the 'help' and 'help launch' commands in lgdb for detailed information on how to use lgdb.

7.4.2 Useful GDB commands

Please see the documentation for GNU debugger documentation for a full list of the gdb commands. Some of the most often used commands are listed below.

Note: that pressing 'ctrl-c' with the program running while in GDB will cause the program to halt and print a backtrace. You can use this to identify problematic areas of the code.

  • break function_name - (or b) insert breakpoint at start of specified function
  • break file:/line_number/ - insert breakpoint at line number in specified file
  • continue - (or c) continue runnning program until next breakpoint is reached
  • next - (or n) step to next line of program (will also step into subroutines)
  • list - (or l) list source code around current position
  • list start_line,/end_line/ - list source code from start_line to end_line in current function.
  • print variable_name - (or p) print the value of the specified variable
  • print array_name/(/index) - print value at specified index of 1D array
  • print array_name/(/index1,/index2/) - print value at specified index of 2D array
  • print array_name/(/index)@/elements/ - print elements values from the array starting at index.
  • quit - (or q) quit gdb and halt the running program.

7.5 DDT Debugger

DDT is a debugging tool for scalar, multi-threaded and large-scale parallel applications.

Check the latest version of the DDT Website

  • DDT Support Page (including the latest User Guide)
  • For the latest Cray information re 'DDT' search for ddt in The Cray Technical Documentation, and choose the XC Series Programming Environment User Guide result.
  • 7.5.1 Download and install the remote client

    The recommended way to use DDT on ARCHER is to install the free Allinea Forge remote client on your workstation or laptop using these instructions.

    Once you have installed the remote client, the instructions below describe how to compile and debug a simple executable.

    7.5.2 Compile the code for debugging

    To compile the code to be debugged you should install the source code on the /work filesystem and compile the executable into a location on /work to ensure that the running job can access all of the required files.

    You will also usually want to specify the -O0 option to turn off all code optimisation (as this can produce a mismatch between source code line numbers and debugging information) and -g to include debugging information in the compiled executable.

    For example, using the simple MPI code from the ARCHER Quick Start Guide we would compile with:

    auser@eslogin01:/work/x01/x01/auser> ftn -O0 -g -o hello_world.x hello_world.f90
    

    7.5.3 Set up the debugger to submit jobs to ARCHER

    We must now tell the remote client how to submit jobs to the ARCHER job submission system. You should only need to configure this once and the client will remember for future debugging sessions.

    On the main DDT interface, click "Options" and on the dialog box that appears, select "Job Submission" from the list on the left. Ensure that the settings are set up as illustrated below and click "OK":

    (The path to the Submission template file is /home/y07/y07/cse/allinea/templates/archer.qtf.)

    Submit command
    qsub
    Cancel command
    qdel
    Display command
    qstat

    7.5.4 Run your debugging session on your program

    Now everything is configured we can debug our program. On the main DDT interface click "Run". This will bring up a dialogue where you can specify the path to your executable and other options such as the number of processors to use and the walltime for the job. An example of the dialog is shown below with dummy values completed for the executable name and the working directory. For our small example we are just using a single node (24 cores) and running for just 10 minutes (so we can use the "short" queue).

    Note: to use the short queue your job must have a maximum run time of 20 minutes. If you wish to run for longer you should remove the queue specification so that you run in the standard ARCHER queue.

    Once all the options have been set up you can submit your debugging session to the ARCHER queues by clicking "Submit".

    A dialog showing the ARCHER queue will appear while the tool waits for your job to start. Note: you may see the warning message below which may be safely ignored.

    pbs_iff: cannot connect to host
    pbs_iff: cannot connect to host
    No Permission.
    qstat: cannot connect to server sdb (errno=15007)

    Once the job starts the a dialog will appear while the debugger connects to your running processes.

    Finally, the debugging interface will appear, allowing you to interactively debug your program.

    7.5.5 Finishing your debugging session

    To finish the debugging session, just quit the remote client on your workstation or laptop; DDT will ensure that the session is cleaned up properly.

    7.5.6 Using DDT directly on the compute nodes

    If you intend to use DDT directly on the compute nodes instead of using the remote client, you will need to load the allinea module before compiling and linking your program, and before executing your program on the compute nodes:

    module load allinea
    

    The User Guide gives instructions on how to compile and execute your program, and the command

    ddt -help
    

    lists the options for the ddt command. Please contact the ARCHER helpdesk for assistance with using DDT directly on the compute nodes.

    7.5.7 Memory debugging of statically-linked programs

    When using memory debugging with statically-linked programs, the debugging version of the malloc library needs to be included when the program is linked.

    Load the allinea module

    module load allinea
    

    and add the following arguments to the command line when linking the program with the compiler wrapper

    -L $ALLINEA_TOOLS_DIR/lib/64 -Wl,--whole-archive -ldmallocthcxx -Wl,--no-whole-archive -Wl,--allow-multiple-definition
    

    The standard malloc library is usually linked by the compiler wrapper. This is replaced with the debugging version. Using whole-archive ensures that any libraries automatically loaded by the compiler wrapper use the debugging version. Using allow-multiple-definition ensures that the standard malloc library is ignored.

    With the current version of DDT on Archer (Version 4.2.2_39977), care needs to be taken to avoid preloading the dynamic version of the dmalloc library when debugging a statically-linked program: if DDT attempts to preload the library, the session will hang. (Note: This is not the current default version on ARCHER any more!)

    In ~/.allinea/system.config make sure that the line

    preload =
    

    is just that, with nothing assigned to preload. Edit the file if required (or it is possible to change this using the DDT GUI, see below).

    Start your DDT session. After ticking the 'Memory Debugging' box, the configuration screen will look like

    If you decide to change some things in the 'Memory Debugging' entry by pressing 'Details...' in that entry, ALWAYS untick the 'Preload the memory debugging library' box before pressing 'OK'

    You can use this to reset the preload= line in ~/.allinea/system.config rather than editing the file.