Veröffentlicht am

Upcoming HPC changes and service limitations

Upcoming HPC changes and service limitations

*** This article will regularly be updated; last update: Sept. 23, 2013 ***

The coming weeks and months will be exciting for the staff of RRZE’s HPC group but also for all users of RRZE’s HPC systems — and in the end bring clear improvements:

  • DONE: the main memory of „cshpc“ has already been doubled during RRZE’s network maintenance on June 20th.
  • DONE: the size of the „general purpose“ storage /home/woody has been increased on July 8th
    => more storage for those who really need it; increased quota will be charged at 2€ per 100 GB and month
  • DONE: Operating system upgrade is planed for LiMa (CentOS5 -> CentOS6)
    => newer versions of many packages coming from the base OS
  • Operating system upgrade is planed for Tiny* (Ubuntu 10.04 LTS -> Ubuntu 12.04 LTS)
    => newer versions of many packages coming from the base OS
  • a new 500+ nodes (10k+ cores) cluster „Emmy“ will be installed for highly parallel applications (and hopefully go into early operation in late autumn)
    => significant increase of installed compute capabilities
    The new cluster will have more cores per node than any other production system at RRZE but also considerably lower clock frequencies and require the use of AVX instructions to get maximum performance.

However, this work will temporarily also bring (serious) limitations (and chances) as detailed in the following:

  • To integrate the new disk enclosures into the /home/woody filesystem, a hopefully short but complete downtime of all Linux-based HPC systems will be required. Unfortunately, the enclosures did not arrive in time for the integration to be done during yesterdays network downtime.
    DONE: on July 8th
  • To get floor space for the new system, TinyBlue and the Windows HPC cluster have to be relocated in the coming weeks. Please expect a downtime of these two systems for several days. The exact schedule will be announced shortly. As with old trees, there is no guarantee that the systems will come up again – and both systems are out of warranty.
    DONE: The Windows HPC cluster has successfully been relocated on July 3rd/4th. It’s up and running again.
    DONE: TinyBlue has been removed from its current place on July 8th. A system reservation is in place. Expect quite some time (i.e. days or weeks) until TinyBlue is available again as major work on the electric infrastructure has to be done in the background!
    TinyBlue is back in operation since late afternoon of July 11th.
  • Two new electric control cabinets have to be installed to replace two smaller ones in order to have enough capacity for the existing HPC systems and the new one. This requires switching off Woody, TinyGPU and TinyBlue for some time. For Woody and TinyGPU, the interruption should be less than a day as cables mainly have to be switched from one electric control cabinet to an other one. For TinyBlue, there is more work as new power lines have to be installed which only can be done once the old electric control cabinets are removed.
    DONE: Work mainly takes place in week 28 and beginning of week 29.
    Have a look at a photo of one of the new power lines (posted by Michael Meier on Google+)!
  • The operating system upgrades may require that many applications have to be recompiled. We will take the upgrade also as an opportunity to clean up the list of installed software packages and modules.
    DONE: LiMa went online with CentOS6 on Sept. 20th; some cleanup-work is still going on. To make the software environment and provided software version more similar on the different HPC systems, some module updates/upgrades also took place on Woody.
  • To increase the throughput of single-node (and single-core) jobs on Woody, the maximum job size will (probably in July) be further decreased from currently 32 nodes to at most 16 nodes. Users of larger parallel jobs are encouraged to shift to e.g. LiMa asap.
    DONE: The maximum job size has been reduced to 16 nodes for the work queue and to 8 in the devel queue on July 4th. A further decrease to 8 (or 12) nodes in the work queue is scheduled for beginning of December.
  • Single node jobs on Woody will (probably also from July on) default to use any node type – currently, the newer w10xx nodes are only used if single node jobs request them explicitly by using „:sb“.
    DONE new routing effective since Sept. 16th. (Together with the installation of 36 „Haswell“ nodes w11xx with property „:hw“)
  • The installation of the new cluster will bring the (cooling) infrastructure in the old computer science tower to or beyond its limits. Some short downtimes may be required in the coming weeks for work on the electric infrastructure. Division G of the central university administration and the Staatliche Bauamt together with an external engineering company are working since more than half a year on improving the cold water distribution in the building to ensure that all consumers – and not only HPC – get their required capacity. While the new pipes are welded, several outages of the cold water supply and the cooling have to be expected, i.e. announced shutdowns of the HPC systems. Moreover, as these works are significantly delayed, there will be shortage of cooling capacity during summer and autumn. To guarantee an overall stable operation of RRZE and in several small computer rooms of computer science chairs, it may be required to shutdown certain HPC systems partially or completely at short notice or even without notice to prevent further damage or outage of central services like e-mail. We currently do not know when it will happen or which HPC systems have to suffer as the cooling infrastructure is very complex and „fragile“. Be prepared that any HPC system can suddenly become unavailable for some time!!
  • For the first few months of (early) operation of the new 500+ node cluster „Emmy“ in late summer / early autumn we are looking for a few interested users with high computational demands (short term projects with >1 million core hours; preferable highly parallel but throughput may also be o.k.) who are also willing to optimize their code performance together with us. Please contact hpc@rrze for further details. Only one code per group, please.
    DONE: The call for early users has been closed in the meantime.

Last but not least: be reminded that there are also national compute centers in Munich/Garching, Stuttgart and Juelich which offer compute cycles (usually after a peer-review process). The national offers are complemented by European ones like PRACE and DECI. Here are just a few links:

To certain extent, the Linux cluster at LRZ in Munich/Garching (but not SuperMuc) can be used by scientists from Erlangen at short notice without much bureaucracy. Contact hpc@rrze if interested.