Data Transfer Guide

This page gives an overview of the different mechanisms for transferring data to and from ARCHER, the UK-RDF and remote machines over JANET.

Overview

Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.

  • Disk speed - The ARCHER /work file-systems and the RDF file-systems are highly parallel consisting of a very large number of high performance disk drives. This allows them to support a very high data bandwidth. Unless the remote system has a similar parallel file-system you may find your transfer speed limited by disk performance.
  • Meta-data performance - Meta-data operations such as opening and closing files or listing the owner or size of a file are much less parallel than read/write operations. If your data consists of a very large number of small files you may find your transfer speed is limited by meta-data operations. Meta-data operations performed by other users of the system will interact strongly with those you perform so reducing the number of such operations you use, may reduce variability in your IO timings.
  • Network speed - Data transfer performance can be limited by network speed. More importantly it is limited by the slowest section of the network between source and destination.
  • Fire-wall speed - Most modern networks are protected by some form of fire-wall that filters out malicious traffic. This filtering has some overhead and can result in a reduction in data transfer performance. The needs of a general purpose network that hosts email/web-servers and desktop machines are quite different from a research network that needs to support high volume data transfers. If you are trying to transfer data to or from a host on a general purpose network you may find the fire-wall for that network will limit the transfer rate you can achieve.

Using the RDF

The Research Data Facility (RDF) consists of 7.8PB disk, with an additional 19.5 PB of backup tape capacity. The RDF is external to the national services, and is designed as long term data storage. The RDF file-systems are directly mounted on the ARCHER login nodes and the nodes used to run serial batch jobs. These file-systems are not visible from the compute nodes. The RDF has 3 filesystems:

/general
/epsrc
/nerc

The file-system a user has access to depends on their funding body.

Archiving

If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger "archive" file for long term storage. A single large file makes more efficient use of the file-system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar, cpio and zip. When using these commands to prepare a file for the RDF, it is good practice to forgo compression as this will slow the archiving process.

tar command

The tar command packs files into a "tape archive" format intended for backup purposes. The command has general form:

tar [options] [file(s)]

Common options include -c "create a new archive", -v "verbosely list files processed", -W "verify the archive after writing", -l "confirm all file hard links are included in the archive", and -f "use an archive file" (for historical reasons, tar writes its output to stdout by default rather than a file). Putting these together:

tar -cvWlf mydata.tar mydata

will create and verify an archive ready for the RDF. Further information on the hard link check can be found in the tar manual.

To extract files from a tar file, the option -x is used. For example:

tar -xf mydata.tar

will recover the contents of "mydata.tar" to the current working directory.

To verify an existing tar file against a set of data, the -d "diff" option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:

$> tar -df mydata.tar mydata
mydata/damaged_file: Mod time differs
mydata/damaged_file: Size differs

Note that tar files do not store checksums with their data, requiring the original data to be present during verification.

cpio command

The cpio utility is a common file archiver and is provided by most Linux distributions. The command has form:

cpio [options] < in > out

Note cpio uses stdin and stdout for its input and output functionality. The utility does not provide a "recursive" flag like tar and zip and is hence often used with the find command when working with directories.

Common options include -o "create an archive (copy-out mode)", -v "verbose mode", and -H "use the given archive format". The recommended format is crc as this provides checksum support at the cost of compatibility with older versions of cpio. Together:

find mydata/ | cpio -ovH crc > mydata.cpio

will create an archive ready for the RDF.

Extraction is performed via the -i "copy-in" flag usually paired with -d to ensure directories are created as needed. For example:

cpio -id < mydata.cpio

recovers the contents of the archive to the working directory.

Archive verification can be performed in -i mode with the --only-verify-crc flag set. As the name implies, this skips the file extraction and only verifies the checksum for each file in the archive. An example of this on a damaged archive follows:

$> cpio -i --only-verify-crc < mydata.cpio
cpio: mydata/file: checksum error (0x1cd3cee8, should be 0x1cd3cf8f)
204801 blocks
zip command

The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:

zip [options] mydata.zip [file(s)] 

Common options are -r used to zip up a directory and -# where "#" represents a digit ranging from 0 to 9 to specify compression level, 0 being the least and 9 the most. Default compression is -6 but we recommend using -0 to speed up the archiving process. Together:

zip -0r mydata.zip mydata

will create an archive ready for the RDF. Note: Unlike tar and cpio, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system.

The corresponding unzip command is used to extract data from the archive. The simplest use case is:

unzip mydata.zip

which recovers the contents of the archive to the working directory.

Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip provides options for verifying this checksum against the stored files. The relevant flag is -t and is used as follows:

$> unzip -t mydata.zip
Archive:  mydata.zip
    testing: mydata/                 OK
    testing: mydata/file             OK
No errors detected in compressed data of mydata.zip.

Local Copy from ARCHER

Because the RDF file-systems are directly mounted on the ARCHER login nodes standard commands such as cp and rsync can be used to copy files across from the /home and /work file-systems. You should use these rather than network transfer tools as these are usually faster.

cp command

Using the cp command creates a copy of a file, or if given the -r flag a directory, at the given destination. This can be run from the command line, as follows:

cp [options] source destination

However, if you are transferring a large amount of data, you may wish to use the serial nodes on ARCHER. In this case you should use a submission script, for example:

#!/bin/bash --login
#
#PBS -l select=serial=true:ncpus=1
#PBS -l walltime=00:20:00
#PBS -A [budget]

cd $PBS_O_WORKDIR

cp [-r] source destination

In the above script 'source' should be the absolute path of the file/directory being copied or the script should be stored in and submitted from the directory containing the source file/directory.

If you want the batch job to run after another batch job has completed, for example to move the results generated by a parallel job, you can do this by specifying a dependency in the qsub flags

$ qsub -W depend=afterok:previous-job-id copyscript.pbs

You should not use the mv command to move data between file-systems. Within a single file-system this command is very fast as it just renames the file or directory. When moving between file-systems it is equivalent to copies followed by deletes. There is therefore absolutely no speed advantage and it is much safer to perform the delete later once you are sure the data has been copied correctly.

rsync command

The rsync command creates a copy of a file, or if given the -r flag a directory, at the given destination, as with the example above. However, this is the form used when performing a 'local' copy, to a directly mounted file-system. The general form for local copies made with rsync is:

rsync [options] source destination

Again for the transfer of a large amount of data, you may wish to use the serial nodes on ARCHER. In this case you should use a submission script, for example:

#!/bin/bash --login
#
#PBS -l select=serial=true:ncpus=1
#PBS -l walltime=00:20:00
#PBS -A budget

cd $PBS_O_WORKDIR

rsync [-r] source destination

In the above script 'source' should be the absolute path of the file/directory being copied or the script should be stored in and submitted from the directory containing the source file/directory.

Because rsync attempts to 'mirror' directories between the two machines, transferring directories containing large numbers of files will result in a large number of meta-data operations. This can significantly reduce performance of data transfers. However rsync can still a good choice when re-synchronising a previously copied directory that contains very large files, as rsync will not move files that already exist (and have the correct size and date) at the destination. If your file sizes are fairly small (less than a GB) then the extra meta-data operations needed might be more expensive than the time saved so a simple cp that overwrites all the data might be faster.

Transfer Nodes

The RDF has its own data transfer nodes (dtn01.rdf.ac.uk, dtn02.rdf.ac.uk) that are specifically intended to support import and export of remote data. You should use these nodes when importing/exporting data to/from the RDF disks from remote machines. These are also the nodes where we support specialised data transfer software and additional network connections (such as the dedicated PRACE network). If you have specialised data transfer requirements, you may need to use these nodes.

Data Transfer via SSH

The easiest way of transferring data to or from ARCHER is to use one of the standard programs based on the SSH protocol such as scp, sftp or rsync. These all use the same underlying mechanism (ssh) as you normally use to log-in to ARCHER. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine. To avoid having to type in your password multiple times you can set up a ssh-key as documented in the user-guide.

The ssh command encrypts all traffic it sends. This means that file-transfer using ssh consumes a relatively large amount of cpu time at both ends of the transfer. The login nodes for ARCHER and RDF have fairly fast processors that can sustain about 100 MB/s transfer but you may have to consider alternative file transfer mechanisms if you want to support very high data rates. The encryption algorithm used is negotiated between the ssh-client and the ssh-server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by reqeusting a different algorithm than the default. The arcfour algorithm is usually quite fast if both hosts support it.

A single ssh based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce meta-data interactions it is a good idea to overlap transfers of files from different directories.

scp command

The scp command creates a copy of a file, or if given the -r flag a directory, on a remote machine. Below shows an example of the command to transfer files to ARCHER:

scp [options] source user@login.archer.ac.uk:[destination]

In the above example, the [destination] is optional, as when left out scp will simply copy the source into the users home directory. Also the 'source' should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.

If you want to request a different encryption algorithm add the -c algorithm-name flag to the scp options.

If you need to run scp from within a batch job see special instructions on how to use ssh-keys from batch jobs

rsync command

The rsync command can also transfer data between hosts using a ssh connection. It creates a copy of a file, or if given the -r flag a directory, at the given destination, similar to scp above. However, given the -a option rsync can also make exact copies (including permissions), this is referred to as 'mirroring'. In this case the rsync command is executed with ssh to create the copy of a remote machine. To transfer files to ARCHER the command should have the form:

rsync [options] -e ssh source user@login.archer.ac.uk:[destination]

In the above example, the [destination] is optional, as when left out rsync will simply copy the source into the users home directory. Also the 'source' should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.

Additional flags can be specified for the underlying ssh command by using a quoted string as the argument of the -e flag. e.g.

rsync [options] -e "ssh -c arcfour" source user@login.archer.ac.uk:[destination]

Other Data Transfer protocols

For very large data transfers it may be necessary to use more specialised tools. For performance reasons these use (multiple) non-encrypted socket connections. As a result, it is usually necessary to have a range of TCP/IP ports open in the fire-walls before these tools can be used. On the RDF data-transfer nodes we support a port range of 50000,52000.

Globus online

Globus online is a web-based file transfer portal provided by the globus project:

You will need to register with the portal and create an account before you can use it. Internally globus-online uses the Grid-FTP file transfer mechanism but the web-portal provides a simple user interface. It also handles all the scripting of file-transfers. Globus-online will retry failed transfers and send notifications when transfers complete. So there is no need to stay logged into the web-site while waiting for transfers to complete. To transfer data between sites both ends of the transfer need to have installed a globus-online end-point. Globus also provide special software you can install on your laptop or desktop that can act as a end-point. You have to activate an end-point before use, either by enabling the connector software on your local machine or by providing login details in your browser for a server end-point. Endpoints will remain active for a couple of days allowing transfers to complete. The Globus-online end-point on the RDF is called Archer RDF or archer#rdf. When activating this end-point use the same username and password you use to login to the rdf.

Grid-FTP

The RDF data-transfer nodes support Grid-FTP. If you have a personal grid-certificate you can register the certificate DN via the SAFE and then access the Grid-FTP servers using globus-url-copy.

-bash-4.1$ grid-proxy-init
Your identity: /C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=stephen booth
Enter GRID pass phrase for this identity:
Creating proxy ............................................ Done
Your proxy is valid until: Sat Feb  7 01:43:08 2015

-bash-4.1$ globus-url-copy -vb file:///general/z01/z01/spb/random_4G.dat gsiftp://dtn02.rdf.ac.uk/general/z01/z01/spb/copy.dat
Source: file:///general/z01/z01/spb/
Dest:   gsiftp://dtn02.rdf.ac.uk/general/z01/z01/spb/
  random_4G.dat  ->  copy.dat

   3129999360 bytes       687.05 MB/sec avg       789.00 MB/sec inst

In the above example the gsiftp protocol tells globus-url-copy to connect to the grid-ftp daemon running on dtn02. You can also use the globus-online web-based portal at https://www.globus.org to manage grid-ftp transfers.

If you do not have a personal certificate the data-transfer nodes also support grid-ftp initiated via ssh.

[spbooth@jasmin-xfer1 ~]$ globus-url-copy -vb sshftp://spb@dtn01.rdf.ac.uk/general/z01/z01/spb/random_4G.dat file:///home/users/spbooth/random_4G.dat
Source: sshftp://spb@dtn01.rdf.ac.uk/general/z01/z01/spb/
Dest:   file:///home/users/spbooth/
  random_4G.dat

   3157262336 bytes        30.72 MB/sec avg        13.50 MB/sec inst

This uses your normal ssh credentials to authenticate the connection but the data is sent over seperate sockets so is not encrypted.

The -p flag to globus-url-copy controls how many parallel sockets to use to transfer data. The best value to use depends on network conditions but 4-8 streams is usually fairly good.

bbcp

The bbcp tool allows you to transfer large amounts data using parallel unencrypted streams with the authentication provided by ssh credentials.

Note: bbcp needs to be installed on both the source and destination hosts.

bbcp downloads and full documentation can be found at:

To use bbcp on the RDF you must first load the bbcp module:

module load bbcp

When copying data from the RDF DTN's you can use the following syntax:

module load bbcp
bbcp -z -s 2 -T 'ssh user@login.archer.ac.uk /usr/local/packages/cse/bbcp/13.05.03/bbcp' my_data.tar.gz user@login.archer.ac.uk:from_rdf.tar.gz

This copies data from the RDF to ARCHER (note that the RDF filesystems are mounted on ARCHER so this is just for illustration purposes, you would not use this mechanism to move data from the RDF to ARCHER).

  • The -s 2 option specifies that two parallel transfer streams should be used
  • The -T option specifies the command to launch bbcp on the remote site
  • The -z option tells bbcp to use reverse connection protocol (useful for avaoiding firewall problems)

If you wish to transfer data to the RDF from a remote host, the command on the remote host would look something like:

bbcp -s 4 -T "ssh user@dtn02.rdf.ac.uk module
load bbcp; bbcp" my_data.tar.gz user@dtn02.rdf.ac.uk:to_rdf.tar.gz

If you have any questions about copying your data to/from ARCHER or the RDF, please contact the ARCHER helpdesk via support@archer.ac.uk.