Die letzten Meldungen

Wartungsankündigung für die FAUbox am Dienstag, 19.07.2016

13. Juli 2016

Am Dienstag, 19.07.2016 wird die FAUbox voraussichtlich für ca. eine Stunde nicht zur Verfügung stehen, da im Rahmen einer Wartung die Authentifizierung über SAML eingeführt wird.
Weiterlesen...

Auftrag für neuen Supercomputer „Meggie“

8. Juli 2016

FAU erweitert ihre Rechenkapazitäten für Computersimulationen
Weiterlesen...

Ausfall natur.gate [behoben]

7. Juli 2016

Update 08.07.16, 11:46: Die Arbeiten sind abgeschlossen, der neue Router in Betrieb. Falls es noch Probleme gibt bitte bei noc@fau.de melden.
Weiterlesen...

Meldungen nach Thema

 

Woodcrest Cluster

Photograph of the RRZE Woodcrest Cluster

The RRZE's Woodcrest cluster (termed "Woody") (Externer Link:  Bechtle /Externer Link:  HP) is a high-performance compute resource with high speed interconnect. It is intended for distributed-memory (MPI) or hybrid parallel programs with medium to high communication requirements.

The system entered the November 2006 Externer Link:  Top500 list on rank 124 and is now (Externer Link:  November 2007) ranked number 329.

  • 0 compute nodes (w0xx nodes), each with two Xeon 5160 "Woodcrest" chips (4 cores) running at 3.0 GHz with 4 MB Shared Level 2 Cache per dual core, 8 GB of RAM and 160 GB of local scratch disk. This used to be 212 nodes, but they were turned off step by step.

  • Infiniband interconnect fabric with 10 GBit/s bandwith per link and direction

  • 2 frontend systems with the same features as the compute nodes but 320 GB of local scratch disk

  • 1 NFS file server with a capacity of about 50 TB

  • Overall peak performance of 10.4 TFlop/s (6.62 TFlop/s LINPACK)

In 2012, 40 single socket compute nodes with Intel Xeon E3-1280 processors (4-core "SandyBrdige", 3.5 GHz, 8 GB RAM and 400 GB of local scratch disk) have been added (w10xx nodes). These nodes are only connected by GBit ethernet. Therefore, only single-node (or single-core) jobs are allowed in this segment.

In 2013, 72 single socket compute nodes with Intel Xeon E3-1240 v3 processors (4-core "Haswell", 3.4 GHz, 8 GB RAM and 900 GB of local scratch disk) have been added (w11xx nodes). These nodes are only connected by GBit ethernet. Therefore, only single-node jobs are allowed in this segment. These nodes replaced three racks full of old w0xxx-nodes, providing significantly more compute power at a fraction of the power usage.

Although Woody was originally a system that was designed for running parallel programs using significantly more than one node, the communications network is pretty weak compared to our other clusters and todays standards. It is therefore now mostly intended for running single node jobs. Note however that the rule Jobs with less than one node are not supported on the w0xx nodes and are subject to be killed without notice still applies. In other words, you cannot reserve single CPUs, the minimum allocation is one node. In the w10xx segment, also single cores can be requested as an exception.

This website shows information regarding the following topics:

Access, User Environment, and File Systems

Access to the machine

Access to the system is granted via a number (currently two) frontend nodes via ssh. Please connect to

woody.rrze.uni-erlangen.de

and you will be randomly routed to one of the frontends. All systems in the cluster, including the frontends, have private IP addresses in the 10.188.82.0/23 range. Thus they can only be accessed directly from within the FAU networks. If you need access from outside of FAU you have to connect for example to the dialog server cshpc.rrze.uni-erlangen.de first and then ssh to Woody from there. While it is possible to ssh directly to a compute node, a user is only allowed to do this when they have a batch job running there. When all batch jobs of a user on a node have ended, all of their shells will be killed automatically.

The login and compute nodes run 64-bit SuSE Linux Enterprise Server. As on most other RRZE HPC systems, a modules environment is provided to facilitate access to software packages. Type "module avail" to get a list of available packages.

File Systems

The following table summarizes the available file systems and their features. Also check the main file system table in the HPC environment description.

File system overview for the Woody cluster
Mount pointAccess viaPurposeTechnology, sizeBackupData lifetimeQuota
/home/hpc$HOMEStorage of source, input and important resultscental servers, 5 TBYES + SnapshotsAccount lifetimeYES (very restrictive)
/home/vaultMid- to longterm storagecentral servers, HSMYES + SnapshotsAccount lifetimeYES
/home/woody$WOODYHOMECluster-local large volume storageNFS, 35 TBNOAccount lifetimeYES
/tmp$TMPDIRTemporary job data directoryNode-local RAID0 array, 130 GBNOJob runtimeNO

NFS file systems $HOME and $WOODYHOME

When connecting to one of the front end nodes, you'll find yourself in your regular RRZE $HOME directory (/home/rrze/...). The cluster also has its own home directory tree which can be accessed much faster. This local directory tree is mounted at /home/woody/$GROUP/$USER/ and available via the shell variable $WOODYHOME. Please note that there is no backup by the RRZE for any data stored in this local tree!

Quotas are active on $WOODYHOME. New users get a standard quota of 25 GBytes; more space is available on request. All users should regard disk space as a valuable resource and not use it as a long-term archive.

Parallel file system $FASTTMP

The parallel file system ($FASTTMP = /wsfs/...) was retired in summer 2012.

Node-local storage $TMPDIR

Each node has 130 GB of local hard drive capacity for temporary files available under /tmp/ (also accessible via /scratch/). All files in these directories which are older than a certain number of days (currently 12) will be deleted automatically without any notification.

If possible, compute jobs should use the local disk for scratch space as this reduces the load on the central servers. Important data to be kept can be copied to a cluster-wide volume at the end of the job, even if the job is cancelled by a time limit. See the section on batch processing for details.

In batch scripts the shell variable $TMPDIR points to a node-local, job-exclusive directory whose lifetime is limited to the duration of the batch job. This directory exists on each node of a parallel job separately (it is not shared between the nodes). It will be deleted automatically when the job ends. Please see the section on batch processing for examples on how to use $TMPDIR.

Software Development

You will find a wide variety of software packages in different versions installed on the cluster frontends. The module concept is used to simplify the selection and switching between different software packages and versions. Please see the section on batch processing for a description of how to use modules in batch scripts.

Compilers

Intel

Intel compilers are the recommended choice for software development on Woody. A current version of the Fortran90, C and C++ compilers (called ifort, icc and icpc, respectively) can be selected by loading the intel64 module. For use in scripts and makefiles, the module sets the shell variables $INTEL_F_HOME and $INTEL_C_HOME to the base directories of the compiler packages.

As a starting point, try to use the option combination -O3 -xP when building objects. All Intel compilers have a -help switch that gives an overview of all available compiler options. For in-depth information please consult the local docs in $INTEL_[F,C]_HOME/doc/ and Intel's online documentation for Externer Link:  C/C++ and Externer Link:  Fortran compilers.

These compilers generate 64-bit objects. Production and use of 32-bit objects is not supported by RRZE, although you might be able to successfully run pre-built 32-bit binaries.

Endianness

All x86-based processors use the little-endian storage format which means that the LSB for multi-byte data has the lowest memory location. The same format is used in unformatted Fortran data files. To simplify the handling of big-endian files (e.g. data you have produced on IBM Power, Sun Ultra, or NEC SX systems) the Intel Fortran compiler has the ability to convert the endianness on the fly in read or write operations. This can be configured separately for different Fortran units. Just set the environment variable F_UFMTENDIAN at run-time.

Examples:

Effect of the environment variable F_UFMTENDIAN
F_UFMTENDIAN= Effect
big everything treated as BE
little everything treated as LE (default)
big:10,20 everything treated as LE, except for units 10 and 20
"big;little:8" everything treated as BE, except for unit 8

GNU

The GNU compiler collection (GCC) is available directly without having to load any module. As the cluster is running an enterprise version of SuSE Linux, do not expect to find the latest GCC version here. Be aware that the default Intel MPI module assumes the Intel compiler and does not work with the GCC. For details see the section on parallel computing.

DDT Parallel Debugger

Externer Link:  DDT (Distributed Debugging Tool), sold by Externer Link:  Allinea, is a parallel GUI-based source-level debugger, similar to Totalview which is installed on the Transtec cluster. With DDT you can debug serial and MPI-parallel programs, i.e. single step, set breakpoints, inspect variables etc.. To use DDT, the following steps should be performed:

  1. Connect to a Woody frontend with X forwarding enabled (-X option to ssh) and start an interactive batch job with the -X switch to qsub.
  2. Load the ddt module and execute the ddt command. Do not put it to the background!
  3. If this is the first time you use DDT, you are prompted to create a new configuration file. Choose "intel-mpi" as the MPI implementation and check "Do not configure DDT for attaching this time" on the next screen. Finally, accept the location of the config file that is suggested to you.
  4. In the "session control" window, click on "advanced". Enter the path to your executable in the top box ("application") and specify any command line arguments below. At the bottom of the window, select the number of processes you wish to start.
  5. Again in the session control window, click on "Change" and select "Submit job through batch or configure own mpirun command". In the "Submit command" input box type "mpirun -n NUM_PROCS_TAG -ddt PROGRAM_ARGUMENTS_TAG". The placeholders NUM_PROCS_TAG and PROGRAM_ARGUMENTS_TAG will get substituted automatically when the command is run.
  6. Click on "Submit" to start your application.
  7. For serial applications, select "none" as the MPI implementation.

There are many more options for debugging with DDT. Full documentation is accessible in the GUI via the Help menu or in ${DDT}/doc/. Check in particular the userguide.pdf and the quickstart-*.pdf documents. If you think you have encountered a bug, please contact hpc@rrze.

DDT is a commercial application with considerable license fees. The number of concurrent processes that can be run under DDT's control is limited. Please exit DDT at the end of your debugging session to free resources for other users.

MPI Profiling with Intel Trace Collector/Analyzer

Intel Trace Collector/Analyzer are powerful tools that acquire/display information on the communication behaviour of an MPI program. Peformance problems related to MPI can be identified by looking at timelines and statistical data. Appropriate filters can reduce the amount of information displayed to a manageable level.

In order to use Trace Collector/Analyzer you have to load the itac module. This section describes only the most basic usage patterns. Complete documentation can be found in ${VT_ROOT}/doc/, on Externer Link:  Intel's ITAC website, or in the Trace Analyzer Help menu.

Trace Collector (ITC)

ITC is a tool for producing tracefiles from a running MPI application. These traces contain information about all MPI calls and messages and, optionally, on functions in the user code. To use ITC in the standard way you only have to re-link your application. If you want to add user function information to the trace, the code must by instrumented manually using the ITC API and recompiled. Please note that we currently support Intel MPI only.

Shell variables for compiling and linking an MPI application with ITC
VariableUseExampleComments
$ITC_LIBLink against ITC librariesmpif90 *.o -o a.out $ITC_LIBPlace after object files (but before any MPI library) on linker command line! Trace files are not written if MPI code does not finish correctly.
$ITC_LIBFSLink against "failsafe" ITC librariesmpif90 *.o -o a.out $ITC_LIBFSPlace after object files (but before any MPI library) on linker command line! Use this variant for MPI codes that do not finish correctly. More intrusive than $ITC_LIB.
$ITC_INCInclude directory with ITC API headersmpicc $ITC_INC -c hello.c-

After an MPI application that has been compiled or linked with ITC has terminated, a collection of trace files is written to the current directory. They follow the naming scheme <binary-name>.stf* and serve as input for the Trace Analyzer tool.

Trace Analyzer (ITA)

The <binary-name>.stf file produced after running the instrumented MPI application should be used as an argument to the traceanalyzer command:

traceanalyzer <binary-name>.stf

The trace analyzer processes the trace files written by the application and lets you browse through the data. Click on "Charts-Event Timeline" to see the messages transferred between all MPI processes and the time each process spends in MPI and application code, respectively. Click and drag lets you zoom into the timeline data (zoom out with the "o" key). "Charts-Message profile" shows statistics about the communication requirements of each pair of MPI processes. The statistics displays change their content according to the currently displayed data in the timeline window. Please consider the Help menu or the docs in ${VT_ROOT}/doc/ to get more information. Additionally,the HPC group of RRZE will be happy to work with you on getting insight into the performance characteristics of your MPI applications.

Parallel Computing

The intended parallelization paradigm on Woody is message passing using the Externer Link:  Message Passing Interface (MPI). Intel compilers also support shared-memory programming in a node with Externer Link:  OpenMP.

OpenMP

The installed Intel compilers support the Externer Link:  OpenMP standard in version 2.5. The compiler recognizes OpenMP directives if you supply the command line option -openmp. This is also required for the link step.

Intel has kindly provided a temporary license for their Externer Link:  Cluster OpenMP product which makes it possible to use OpenMP programs across the cluster interconnect. If you are interested in using Cluster OpenMP, please contact hpc@rrze.

MPI

Although the cluster is basically able to support many different MPI versions, we maintain and recommend to use Externer Link:  Intel MPI. Intel MPI supports different compilers (GCC, Intel). If you use Intel compilers, the appropriate intelmpi module is loaded automatically upon loading the intel64 compiler module. The standard MPI scripts mpif77, mpif90, mpicc and mpicxx are then available. By loading a intelmpi/3.XXX-gnu module instead of the default intelmpi, those scripts will use the GCC.

There are no special prerequisites for running MPI programs. Just use

mpirun [<options>] your-binary your-arguments

By default, one process will be started on each allocated CPU (4 per node) in a blockwise fashion, i.e. the first node is filled completely, followed by the second node etc.. If you want to start n<4 processes per node (e.g. because of large memory requirements) you can specify the -npernode n option to mpirun (-pernode is equivalent to -npernode 1). Finally, if you want to start less processes than CPUs available you can add the -np N option which will only start N processes.

Examples: We assume that the batch system has allocated 8 nodes (32 processors) for the job.

mpirun a.out

will start 32 processes. If r is the rank of an MPI process, rank r will run on node (r % 4).

mpirun -npernode 2 a.out

will start 16 processes, and rank r will run on node (r % 2).

mpirun -pernode -np 4 a.out

will start 4 processes, each on its own node. I.e., 4 of the 8 allocated nodes stay empty. Note that it is currently not possible to start more processes than processors allocated.

We do not support running MPI programs interactively on the frontends. To do interactive testing, please start an interactive batch job on some compute nodes. During working hours, a number of nodes is reserved for short (< 1 hour) tests.

The MPI start mechanism communicates all environment variables that are set in the shell where mpirun is running to all MPI processes. Thus it is not required to change your login scripts in order to export things like OMP_NUM_THREADS, LD_LIBRARY_PATH etc..

Libraries

Mathematical Libraries

Intel [Cluster] Math Kernel Library ([C]MKL)

The Externer Link:  Math Kernel Library provides threaded BLAS, LAPACK, and FFT routines and some supplementary functions (e.g., random number generators). For distributed-memory parallelization there is also SCALAPACK and CDFT (cluster DFT), together with some sparse solver subroutines. It is highly recommended to use MKL for any kind of linear algebra if possible.

After loading the mkl module, several shell variables are available that help with compiling and linking programs that use MKL:

Environment variables for compiling and linking with MKL
VariableUseExample
$MKL_INCCompiler option(s) for MKL include search path.icc -O3 $MKL_INC -c code.c
$MKL_SHLIBLinker options for dynamic linking of LAPACK, BLAS, FFTifort *.o -o prog.exe $MKL_SHLIB
$MKL_LIBLinker options for dynamic linking of LAPACK, BLAS, FFTifort *.o -o prog.exe $MKL_LIB
$MKL_SCALAPACKLinker options for SCALAPACK (includes LAPACK, BLAS FFT)mpicc *.o -o parsolve.exe $MKL_SCALAPACK
$MKL_CDFTLinker options for Cluster DFT functions (includes BLAS, FFT)mpif90 *.o -o parfft.exe $MKL_CDFT

Many MKL routines are threaded and can run in parallel by setting the OMP_NUM_THREADS shell variable to the desired number of threads. If you do not set OMP_NUM_THREADS, the default number of threads is one. Using OpenMP together with threaded MKL is possible, but the OMP_NUM_THREADS setting will apply to both your code and the MKL routines. If you don't want this it is possible to force MKL into serial mode by setting the MKL_SERIAL environment variable to YES.

For more in-depth information, please refer to Intel's Externer Link:  online documentation on MKL.

FFTW

Externer Link:  FFTW is a high-performance, free library for Fast Fourier Transforms. It is used by many software packages. We provide a current version of FFTW that is compatible with the Intel compilers by the fftw module.

Environment variables for compiling and linking with FFTW
VariableUseExample
$FFTW_INCCompiler option(s) for FFTW include search path.icc -O3 $FFTW_INC -c code.c
$FFTW_LIBLinker options for (static) linking of FFTWifort *.o -o prog.exe $FFTW_LIB
$FFTW_BASEBase directory of FFTW installation-

The fftw-wisdom and fftw-wisdom-to-conf tools and their manual pages are also provided in the respective search paths.

Batch Processing

All user jobs except short serial test runs must be submitted to the cluster by means of the Externer Link:  Torque Resource Manager. The submitted jobs are routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme. It is not normally necessary to explicitly specify the queue when submitting a job to the cluster, the sorting into the proper queue happens automatically. The queue configuration looks like follows:

Queues on the Woody cluster
Queue min - max walltime min - max nodes Availablility Comments
route N/A N/A all users Default router queue; sorts jobs into execution queues
devel 0 - 01:00:00 1 - 16 all users Some nodes reserved for queue during working hours
work 01:00:01 - 24:00:00 1 - 32 all users "Workhorse"
onenode 01:00:01 - 48:00:00 1 - 1 all users only very few jobs from this queue are allowed to run at the same time.
special 0 - infinity 1 - all special users Direct job submit with -q special

If you submit jobs that request only one node, then by default you can get any type of node: the very old Core2 based w0xxx or one of the newer SandyBridge or Haswell based w1xxx-nodes. They all have the same number of cores (4) and memory (8 GB) per node, but the speed of the CPUs is vastly different, which means that job runtimes will vary significantly. You will have to calculate the walltime you request from the batch system so that your jobs can finish even on the slowest nodes.

It is possible to request certain kinds of nodes from the batch system. This has two mayor use cases besides the obvious "benchmarking": If you want to run jobs that use less than a full node, those are currently only allowed on the SandyBridge nodes, so you need to request those explicitly. And some applications can benefit greatly from using AVX, using AVX can be up to twice as fast than not using it, but the old Core2 based nodes do not support it, so you need to restrict such jobs to the newer nodes. You request a node property by adding it to your -lnodes=... request string, e.g.: qsub -l nodes=1:ppn=4:sb. In general, the following node properties are available:

Available node properties on the Woody cluster
Property Matching nodes (#) Comments
:any wxxxx (248) Can run on any node in the cluster. This is the Default for jobs requesting only one node.
:c2 w0xxx (136) Can run only on the old Core2 based nodes (that have an Infiniband network). This is the Default for jobs requesting more than one node.
:avx w1xxx (112) Can run on any node that supports AVX, that is both the SandyBridge and Haswell nodes.
:sb w10xx (40) Can run on the SandyBridge nodes only. Required for jobs with ppn other than 4.
:hw w11xx (72) Can run on the Haswell nodes only.

A job will run when the required resources become available. For short test runs with less than one hour of runtime, a number of nodes is reserved during working hours. These nodes are dedicated to the devel queue. Do not use the devel queue for production runs. Since we do not allow MPI-parallel applications on the frontends, short parallel test runs must be performed using batch jobs.

It is also possible to submit interactive jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive (including X11) programs there.

The command to submit jobs is called qsub. To submit a batch job use

qsub <further options> [<job script>]

The job script may be omitted for interactive jobs (see below). After submission, qsub will output the Job ID of your job. It can later be used for identification purposes and is also available as $PBS_JOBID in job scripts (see below). These are the most important options for the qsub command:

Important options for qsub and their meaning
OptionMeaning
-N <job name>Specifies the name which is shown with qstat. If the option is omitted, the name of the batch script file is used.
-o <standard output file>File name for the standard output stream. If this option is omitted, a name is compiled from the job name (see -N) and the job ID.
-e <error output file>File name for the standard error stream. If this option is omitted, a name is compiled from the job name (see -N) and the job ID.
-l nodes=<# of nodes>:ppn=4[:c2]Specifies the requested nodes. In the default segment of Woody, you must always specify ppn=4.
-l nodes=1:ppn=4:anySingle-node job which can run in the w0xx or w10xx segment. Performance may significantly vary depending on the segment.
-l nodes=1:ppn=<1|2|3|4>:sbSingle-core job which only runs in w10xx segment. If PPN is less then 4, the other CPU(s) are considered to be available by Torque and my be assigned to other jobs. Make sure you only use the fraction of main memory according to the PPN value.
-l walltime=HH:MM:SSSpecifies the required wall clock time (runtime). When the job reaches the walltime given here it will be sent a TERM signal. After 60 seconds, if the job has not ended yet, it will be sent KILL. See the section on stage-out below for hints how to use this delay for saving important data.
If you omit the walltime option a short default time is used. Please specify a reasonable runtime, since the scheduler bases its decisions also on this value (short jobs are preferred).
-M x@y -m abeYou will get e-mail to x@y when the job is aborted (a), starting (b), and ending (e). You can choose any subset of abe for the -m option.
-W depend:<dependency list>Makes the job depend on certain conditions. E.g., with -W depend=afterok:12345 the job will only run after Job 12345 has ended successfully, i.e. with an exit code of zero. Please consult the qsub man page for more information.
-r [y|n]Specifies if the job is rerunnable (y, default) or not (n). Under some (error) conditions, Torque will decide to re-queue jobs that had already been running before the error occurred. If a job is not suited for this kind of action, use -r n.
-IInteractive job. It is still allowed to specify a job script, but it will be ignored except for the PBS options. No code will be executed. Instead, the user will get an interactive shell on one of the allocated nodes and can execute any command there. In particular, you can start a parallel program with mpirun.
-XEnable X11 forwarding. If the $DISPLAY environment variable is set when submitting the job, an X program running on the compute node(s) will be displayed at the user's screen. This makes sense only for interactive jobs (see -I option).
-q <queue>Specifies the Torque queue (see above); default queue is route. Usually it is not required to use this parameter as the route queue automatically forwards the job to an appropriate execution queue.

Jobs are always required to request all CPUs in a node (ppn=4). Using less than 4 CPUs per node is not supported and may result in your jobs being killed without further notice.

There are several Torque commands for job inspection and control. The following table gives a short summary:

Useful Torque user commands
CommandPurposeOptions
qstat [<options>] [<JobID>|<queue>]Displays information on jobs. Only the user's own jobs are displayed. For information on the overall queue status see the section on job priorities.-a display "all" jobs in user-friendly format
-f extended job info
-r display only running jobs
qdel <JobID> ...Removes job from queue-
qalter <qsub-options>Changes job parameters previously set by qsub. Only certain parameters may be changed after the job has started.see qsub and the qalter manual page
qcat [<options>]  <JobID> Displays stdout/stderr from a running job-o display stdout (default)
-e display stderr
-f output appended data as the job is running (like tail -f

Batch Scripts

To submit a batch job you have to write a shell script that contains all the commands to be executed. Job parameters like estimated runtime and required number of nodes/CPUs can also be specified there:

Example of a batch script
#!/bin/bash -l
#
# allocate 16 nodes (64 CPUs) for 6 hours
#PBS -l nodes=16:ppn=4,walltime=06:00:00
#
# job name 
#PBS -N Sparsejob_33
#
# stdout and stderr files
#PBS -o job33.out -e job33.err
#
# first non-empty non-comment line ends PBS options

# jobs always start in $HOME -
# change to a temporary job directory on $FASTTMP
mkdir ${FASTTMP}/$PBS_JOBID
cd ${FASTTMP}/$PBS_JOBID
# copy input file from location where job was submitted
cp ${PBS_O_WORKDIR}/inputfile .

# run
mpirun ${WOODYHOME}/bin/a.out -i inputfile -o outputfile

# save output on parallel file system
mkdir -p ${FASTTMP}/output/$PBS_JOBID
cp outputfile ${FASTTMP}/output/$PBS_JOBID
cd 
# get rid of the temporary job dir
rm -rf ${FASTTMP}/$PBS_JOBID

The comment lines starting with #PBS are ignored by the shell but interpreted by Torque as options for job submission (see above for an options summary). These options can all be given on the qsub command line as well. The example also shows the use of the $FASTTMP and $WOODYHOME variables. $PBS_O_WORKDIR contains the directory where the job was submitted. All batch scripts start executing in the user's $HOME so some sort of directory change is always in order.

If you have to load modules from inside a batch script, you can do so. The only requirement is that you have to use either a csh-based shell or bash with the -l switch, like in the example above.

Interactive Jobs

The resources of the Woody cluster are mainly available in batch mode. However, for testing purposes or when running applications that require some manual intervention (like GUIs), Torque offers interactive access to the compute nodes that have been assigned to a job. To do this, specify the -I option to the qsub command and omit the batch script. When the job is scheduled, you will get a shell on the master node (the first in the assigned job node list). It is possible to use any command, including mpirun, there. If you need X forwarding, use the -X option in addition to -I.

Note that the starting time of an interactive batch job cannot reliably be determined; you have to wait for it to get scheduled. Thus we recommend to always run such jobs with wallclock time limits less than one hour so the job will be routed to the devel queue for which a number of nodes is reserved during working hours.

Interactive batch jobs do not produce stdout and stderr files. If you want a protocol of what's happened, use e.g. the UNIX script command.

Staging Out Results

Warning! This does not work with the current version of the batch system due to a software bug!

When a job reaches its walltime limit, it will be killed by the batch system. The job's node-local data will either get deleted (if you use $TMPDIR or be inaccessible because login to a node is disallowed if you don't have a job running there. In order to prevent data loss, Torque waits 60 seconds after the TERM signal before sending the final KILL. If the batch script catches TERM with a signal handler, those 60 seconds can be used to copy node-local data to a global file system:

Example: How to use a shell signal handler to stage out data
#!/bin/bash

# signal handler: catch SIGTERM, save scratch data
trap "sleep 5 ; cd $TMPDIR ; tar cf - * | tar xf - -C ${WOODYHOME}/$PBS_JOBID ; exit" 15

# make job data save directory
mkdir ${WOODYHOME}/$PBS_JOBID

cd $PBS_O_WORKDIR

# assuming a.out stores temp data in $TMPDIR
mpirun ./a.out

The sleep command at the start of the signal handler gives your application some time to shut down before the data is saved. Please note that it is required to use a Bourne or Korn shell variant for catching the TERM signal since csh has only limited facilities for signal handling.

Job Priorities and Reservations

The scheduler of the batch system assigns a priority to each waiting job. This priority value depends on certain parameters (like waiting time, queue, user group, and recently used CPU time (a.k.a. fairshare)). The ordering of waiting jobs listed by qstat does not reflect the priority of jobs. All waiting jobs with their assigned priority are listed anonymously on the HPC user web pages (those pages are password protected; execute the docpw command to get the username and password). There you also get a list of all running jobs, any node reservations, and all jobs which cannot be scheduled for some reason. Some of this information is also available in text form: The text file /home/woody/STATUS/joblist contains a list of all waiting jobs; the text file /home/woody/STATUS/nodelist contains information about node and queue activities.

Further Information

Intel Xeon 5160 "Woodcrest" Processor

The Externer Link:  Xeon 5160 processor implements Intel's Core microarchitecture. It is a dual-core chip running at 3.0 GHz and outperforms previous Xeons with the older Netburst architecture (as used, e.g., in our Transtec cluster) significantly. It features many architectural enhancements like, e.g.,

  • 32kB L1 data cache per core (8-way set-associative, 64B cache line, 2-3 cycles latency, writeback).
  • Two cores sharing a common 4MB L2 cache (16-way set-associative, 64B cache line, 14 cycles latency, writeback).
  • Much shorter pipelines than Netburst.
  • Very short cache latencies compared to Netburst.
  • 4 FLOPs per cycle double precision floating-point throughput with SSE2.
  • Each core can sustain up to one 128-bit load and one 128-bit store operation per cycle.
  • Theoretical memory bandwidth of 10.6 GB/s; less than half of this value is typically seen in applications.
  • Four different hardware prefetchers that try to hide memory latency by loading data and instruction cache lines in advance. In particular, "adjacent cache line prefetch" loads the current and the next cache line on a cache miss automatically, effectively doubling L2 line size to 128B for reads and RFOs.

The Externer Link:  Intel® 64 and IA-32 Architectures Optimization Reference Manual contains in-depth information about the microarchitecture and specific optimization techniques.

HP DL140G3 Compute Node

Block diagram of a single Woody node A compute node comprises two sockets, each housing a Xeon 5160 dual-core chip. Compared to older dual-socket systems, the two frontside buses (FSBs) are not directly connected to each other but to the chipset, which is theoretically able to saturate the bandwidth requirements of both chips concurrently (21.3 GB/s). Due to deficiencies in the bus protocols and other factors, the maximum achievable bandwidth per node is roughly 8 GB/s.

Although the node shows Externer Link:  UMA memory access characteristics, the peculiar structure of two dual-core chips with separate FSBs and the partly shared L2 caches offers some diverse possibilities for parallel programming. If, for some reason, only two of the four cores are actually used, it depends on the code's bandwidth and communication requirements whether one should place the processes (threads) on a single or on separate sockets. It is generally a good idea to "pin" threads/processes to cores in order to get reproducible performance results. Please consult hpc@rrze for further advice.

The Intel 5000X chipset ("Greencreek") features a "snoop filter" that can to some extent lessen the performance impact of the snoop-based Externer Link:  cache coherence protocol. It has an on-chip memory that keeps track of modified cache lines in all the cores' caches.

InfiniBand Interconnect Fabric

The Externer Link:  InfiniBand (IB) network features a non-blocking switch (fat-tree) with static routing. Each node is capable of sending and receiving data at a rate of 10 GBit/s per direction (full duplex), with an MPI latency of less than 5 µs. The IB network is used for MPI communication and the parallel file system. NFS traffic uses a separate GBit Ethernet network.

Letzte Änderung: 29. September 2014, Historie

zum Seitenanfang

Startseite | Kontakt | Impressum

RRZE - Regionales RechenZentrum Erlangen, Martensstraße 1, D-91058 Erlangen | Tel.: +49 9131 8527031 | Fax: +49 9131 302941

Zielgruppennavigation

  1. Studierende
  2. Beschäftigte
  3. Einrichtungen
  4. IT-Beauftragte
  5. Presse & Öffentlichkeit