TinyGPU Cluster
The RRZE's TinyGPU cluster is an experimental cluster for developing and benchmarking applications using GPUs as accelerators.
8 compute nodes, each with two Xeon 5550 "Nehalem" chips (8 cores + SMT) running at 2.66 GHz with 8 MB Shared Cache per chip, 24 GB of RAM (DDR3-1333) and 200 GB of local scratch disk; Two NVIDIA Tesla M1060 GPU Boards in every node
1 compute node with two Xeon 5650 "Westmere" chips (12 cores + SMT) running at 2.66 GHz with 12 MB Shared Cache per chip, 48 GB of RAM (DDR3-1333) and 500 GB of local scratch disk; Two NVIDIA Tesla C2070 GPU Boards (plus two varying other GPUs)
Infiniband interconnect fabric with 20 GBit/s bandwith per link and direction
Jobs with less than one
node are currently not supported by RRZE and are subject to be killed without
notice. Thus, always use ppn=16 in
the node specification for qsub.
This website shows information regarding the following topics:
Access, User Environment, and File Systems
Access to the machine
Access to TinyGPU is through the Woody Frontends. So, connect to
woody.rrze.uni-erlangen.de
and you will be randomly routed to one of the frontends for Woody,
as there are no extra frontends for TinyGPU.
See the documentation for the Woodcrest
cluster for information about these frontends. Although the TinyGPU
compute nodes actually run Ubuntu LTS, the environment is compatible.
Programs compiled for Woody will just run on Tinygpu as
well. In most cases, you even can compile CUDA programs on the
Woody frontends (after loading the cuda
module), although no GPU hardware is available there.
In case of problems, try to compile your
GPU
programs on one of the TinyGPU compute nodes (e.g. within an interactive job).
For submitting Jobs, you will have to use the command qsub.tinygpu
instead of the normal qsub.
In general, the documentation for Woody applies. This page will only list the differences to Woody.
File Systems
Parallel file system $FASTTMP
The parallel filesystem $FASTTMP in /wsfs
is currently not available on TinyGPU.
Node-local storage $TMPDIR
Each node has at least 200 GB of local hard drive capacity for temporary files
(instead of the 130 Woody has)
available under /tmp/ (also accessible via /scratch/).
Compiling and running CUDA codes
Unfortunately, due to the experimental nature of this cluster,
the proper way for doing this is still in the flow. Please
contact hpc-support if you need assistance. However, in
many cases you will find most of the required information by
looking at the (default) cuda module (e.g.
module show cuda).
Batch Processing
The batch system works just like on Woody, the few notable differences are:
- The command for job submission is
qsub.tinygpuinstead of justqsub. - The compute nodes do not have 4 cores like Woody, but 8 physical
cores plus 8 SMT cores. This means
that the operating system will see 16 cores. In the moment, you
have to generally request
ppn=16(orppn=24for the fermi queue) even if you only need less cores and independent of the number of GPUs used per node. A different mechanism may be established in the future, thus, check this documentation regularly for updates. - If you want to get the node with the C2070 GPUs (
tg010), you have to submit your job to the queue "fermi", i.e. useqsub -q fermi .... - With the Nehalem, Intel has reintroduced the concept of Hyper Threading,
although they now call it Simultaneous multithreading
(SMT)
and it actually is useful for some applications this time. You should test
if your application runs better or worse with SMT.
To run a job without using SMT, your still have to request all
16 cores of a node (see previous paragraph!), and then restrict your
program to only the 8 "real" of them. The "real"
cores on TinyGPU are the ones numbered 0-7. Core numbers 0-3 are
the first physical socket, 4-7 the second; 8-15 are the corresponding
virtual cores created by SMT.
If you use mpirun, you can just use the parameters
-npernode 8 -pin "0 1 2 3 4 5 6 7"to restrict your program to the right cores.
Further Information
Intel Xeon 5550 "Nehalem" Processor
The
Xeon 5550 processor
implements Intel's Nehalem microarchitecture and is a dual-core chip running at 2.66 GHz.
The most significant improvements compared to the Core 2 based chips
(as used, e.g., in our Woodcrest cluster)
have been made to the memory interface, and they can dynamically overclock
themselves as long as they stay within their thermal envelope.
The memory interface controllers are now no longer in the chipset, but integrated into the CPU, a concept that is familiar from the Opteron CPUs of Intels competitor AMD. Intel has however decided to go the whole hog: Each CPU has no less than three independant memory channels, which leads to a vastly improved memory bandwidth compared to Core 2 based CPUs like the Woodcrest. Please note that this improvement really only applies to the memory interface. Applications that run mostly from the cache do not run better on Nehalem than on Woodcrest.
The physical CPU sockets are coupled with something called QPI. As the memory is now attached directly to the CPUs, accesses to the Memory of the other socket have to go through QPI and the other processor, so they are more expensive and slower. In other words, the Nehalems are CC-NUMA machines.
InfiniBand Interconnect Fabric
The InfiniBand network on TinyGPU is a double data rate (DDR) network, i.e. the links run at 20 GBit/s in each direction. All 8 nodes are connected to a small DDR switch and can thus communicate with each other fully non blocking.




