Troubleshooting Guide
Commonly occurring errors
Type | Error | Notes | Solution |
BATCH SYSTEM | apsched: request exceeds max nodes, alloc | aprun is being used on a login node. | Use aprun within a PBS job script submitted to the compute nodes using qsub. |
BATCH SYSTEM | apsched: claim exceeds reservation's node-count | The number of nodes required for an aprun command within a PBS job script is larger than the number of nodes requested with the -l select option. | Change the number of nodes requested in the -l select option to match the number required for the aprun command. If the aprun option -n alone is used, the number of nodes required is the number of processes divided by 24, rounded up. If other options are used to change the number of processes per node (e.g., -N, -S, -j) then the calculation is not so easy: please see the aprun section in the User Guide and the aprun man page. |
BATCH SYSTEM | qsub: Archer: Please use the select/place resource selection language | Encountered if the mppwidth/mppnppn combination is used in the job script, and not -l select=[nodes] | Use -l select=[nodes] and remove mppwidth/mppnppn combination as described in the ARCHER User Guide |
BATCH SYSTEM | qsub: request rejected as filter hook 'update_user_environment' encountered an exception. Please inform Admin. | Encountered if there is no select statement in the job script | Use -l select=[nodes] as described in the ARCHER User Guide |
BATCH SYSTEM | The command 'qstat -f [job id]' shows the information 'comment = Not Running: Insufficient amount of resource vntype (cray_compute != )' | Encountered when the system is full. There are not enough free nodes for the job to run. | Job will run when enough resources become free. |
BATCH SYSTEM | The command 'qstat -f [job id]' shows the information 'comment = Not Running: Insufficient amount of resource ncpus (R: 264 A: 234 T: 118320)' (the values for R, A and T will vary) | Encountered when the system is full. There are not enough free nodes for the job to run. | Job will run when enough resources become free. |
BATCH SYSTEM | Jobid 0000.sdb - will not start (comment = Not Running: Host set host=archer_3071 has too few free resources) | Encountered when the system is full. There are not enough free nodes for the job to run. | Job will run when enough resources become free. |
BATCH SYSTEM | Jobid 00000.sdb (user auser) - will not start (comment = Not Running: PBS Error: ARCHER: User auser is not in XXXXXX) | Error seen when account auser is not a member of project group XXXXXX but is trying to access that budget via #PBS -A directive in the job submission script | Check and make sure your ARCHER account auser belongs to the correct projects group on resource pool XC (ARCHER). Do this via SAFE. Note - project group membership is not automatically inherited from HECToR. |
BATCH SYSTEM | Jobid 00000.sdb (user auser) - will not start (comment = Not Running: PBS Error: budget XXXXX does not have enough resource) | Error seen when budget XXXXX has been used up since the job was submitted. | Ask the PI to add additional time resource to budget XXXXX then release job from held state using command qrls -h u <job ID> |
BATCH SYSTEM | [NID 00600] 2013-11-22 16:26:43 Exec my_app.x failed: chdir /home3/x01/x01/user/dir No such file or directory | Error seen when trying to run job on compute nodes on the /home filesystem | Resubmit job using the /work filesystem. (/home not accessible on the compute nodes.) |
LOGINS AND PASSWORDS | Authentication token manipulation error. | Error seen when trying to change password manually on command line more than once per day | Request change via SAFE instead. |
COMPILING | Illegal Instruction | Error seen when running code on the postprocessing/serial nodes that has been compiled for compute nodes. | Recompile for postprocessing/serial nodes: Compiling for Postprocessing/Serial Nodes. |
LOGIN NODES | Process terminates unexpectedly | Processes running on the login nodes that consume more than 10 minutes CPU time may be killed to prevent overloading of the nodes. | Submit a job to the serial queue: Postprocessing/Serial Jobs. |
LOGIN NODES | /usr/bin/xauth: error in locking authority file /home/... | Most likely cause is due to /home usage exceeding quota meaning files such as .Xauthority cannot be saved in your home space | Check disk usage within /home file system against quota with SAFE and delete files or request increased quota from PI |
COMPILING | forrtl: severe (168): Program Exception - illegal instruction | Error seen when running code that has been compiled with the Intel Fortran compiler with the -real-size 64 or -r8 options. The compiler produces AVX2 instructions, which are not provided by the processors in the compute nodes. |
Recompile with the -xAVX option:
Useful compiler options. |
COMPILING | Please verify that both the operating system and the processor support Intel(R) F16C instructions. | Runtime error seen when running serial code in job scripts without using "aprun". The job launcher nodes have Sandy Bridge processors so do not support some instructions produced for Ivy Bridge processors | Run executable on the compute nodes by prepending with "aprun". If this is a serial application then use "aprun -n 1"; otherwise specify the number of parallel tasks using the "-n" option. |