GPU cluster filesystem outage

Updates

Resolved
18 January, 2024 at 13:15
Resolved
18 January, 2024 at 13:15
This incident has been resolved. Affected VMs have either been started again (in the case of CPU VMs and shared servers) or shut down (in the case of GPU VMs). The latter can be started again as needed from https://xo.cl.cam.ac.uk .

The cause of this disruption was that the server on which VM disks and user data are stored failed to start with the correct network configuration after Sunday's planned shutdown (we suspect due to a bug causing it to make some incorrect changes to the configuration during boot). An attempt to rectify this problem 'live' caused about a minute's disruption, unfortunately just long enough to cause Linux VMs to time out I/O operations and shut down their filesystems.
Update
18 January, 2024 at 12:40
Update
18 January, 2024 at 12:40
The storage system has been stabilised. Some VMs need a filesystem repair due to the ungraceful disconnection of their disks. Each VM that was running during this incident will be started to ensure that its filesystems are clean, repaired if necessary, then (for GPU VMs) shut down again.
Identified
18 January, 2024 at 12:16
Identified
18 January, 2024 at 12:16
Running VMs will have failed due to their disks briefly becoming inaccessible. Once the server has been stabilised, affected VMs will be shut down and can be started again from XO. However, work is ongoing first to stabilise the server, which experienced an unexpectedly-disruptive problem with its network configuration.
Investigating
18 January, 2024 at 12:01
Investigating
18 January, 2024 at 12:01
GPU VMs (and other VMs running on the same cluster) may have experienced a filesystem fault due to a problem with the storage server currently under investigation.

University of Cambridge Computer Laboratory - GPU cluster filesystem outage – Incident details

All systems operational