emergency shutdown of HPC clusters due to overheating (solved)

emergency shutdown of HPC clusters due to overheating (solved)

At around 10:40, the cold water supply that drives all air conditioning failed, leading to a hard shutdown of the woodcrest cluster due to overheating at 11:30.

All jobs running at that time were ungently terminated and will need to be resubmitted.

Parts of the other clusters are also down (powered off), but no jobs have been aborted there.

It is currently not known when normal operation will be resumed.

Update 14:30: The problem has been solved, batch processing will be resumed.

Update 18:30: Batch processing has been resumed on all clusters.

Kategorien HPC