There will be a scheduled downtime of the LiMa-cluster at RRZE on starting Monday, August 1st at 08:30.
Reason is an update of the operating system from CentOS 6 to CentOS 7, as on Emmy two weeks ago. As usual, jobs that would collide with the downtime will be postponed. However, this time we cannot guarantee that queued jobs will survive
the update. You might have to requeue your jobs after the downtime.
The LiMa frontends will NOT be available most of the time, as they will be upgraded too.
The upgrade will result in rather big changes in the software environment of LiMa. Make sure to check that your applications still produce correct results and don’t suddenly run at 1/10th of the speed after the upgrade. After the update, the software-environment on LiMa and Emmy will be almost identical once again – which it was not since Emmy was upgraded on July 12.
If everything goes well, we anticipate that LiMa can resume batch processing on Tuesday, August 2nd in the evening. Check the MOTD on the cluster for updates.
UPDATE 2016-08-03, noon:
We observed severe stability issues with the parallel filesystem of LiMa; some not yet identified user workload crashes the meta data servers of the parallel file system (/lxfs=$FASTTMP) making it unresponsive for many minutes and stalling the whole cluster.
- Batch processing on LiMa has now been resumed.
- However, /lxfs=$FASTTMP is not mounted – neither on the login nodes nor on the compute nodes for the time being!
- Jobs where we identified usage of /lxfs=$FASTTMP have been moved to the „big“ queue and wont start.
- A special interactive node („ltest01„) with access to /lxfs=FASTTMP has been made available to allow moving data from /lxfs=$FASTTMP; software from /apps (i.e. provided via modules) is not available on this node.
/lxfs=$FASTTMP hopefully can be made available again in the coming weeks, however, requires a major update of the software on the servers of the parallel file system.
UPDATE 2016-08-09, 17:00
LiMa is fully available again. Thanks to the generous help of NEC, the software on the servers of the parallel file system has been updated. Thus, /lxfs=$FASTTMP is available again on the login nodes and the compute nodes. Regular batch processing has been resumed and all jobs in administrative hold have been started.
Some compute nodes (in rack 7) had to be removed in preparation for the next cluster.