University of Cambridge Computer Laboratory - GPU cluster filesystem outage – Incident details

GPU cluster filesystem outage

Resolved
Partial outage
Started 6 months agoLasted about 1 hour

Affected

Virtual Machine Hosting

Partial outage from 12:01 PM to 1:15 PM

GPUs

Partial outage from 12:01 PM to 1:15 PM

Updates
  • Resolved
    Resolved

    This incident has been resolved. Affected VMs have either been started again (in the case of CPU VMs and shared servers) or shut down (in the case of GPU VMs). The latter can be started again as needed from https://xo.cl.cam.ac.uk .

    The cause of this disruption was that the server on which VM disks and user data are stored failed to start with the correct network configuration after Sunday's planned shutdown (we suspect due to a bug causing it to make some incorrect changes to the configuration during boot). An attempt to rectify this problem 'live' caused about a minute's disruption, unfortunately just long enough to cause Linux VMs to time out I/O operations and shut down their filesystems.

  • Identified
    Update

    The storage system has been stabilised. Some VMs need a filesystem repair due to the ungraceful disconnection of their disks. Each VM that was running during this incident will be started to ensure that its filesystems are clean, repaired if necessary, then (for GPU VMs) shut down again.

  • Identified
    Identified

    Running VMs will have failed due to their disks briefly becoming inaccessible. Once the server has been stabilised, affected VMs will be shut down and can be started again from XO. However, work is ongoing first to stabilise the server, which experienced an unexpectedly-disruptive problem with its network configuration.

  • Investigating
    Investigating

    GPU VMs (and other VMs running on the same cluster) may have experienced a filesystem fault due to a problem with the storage server currently under investigation.