University of Cambridge Computer Laboratory - VM storage fault (xene-pool1) – Incident details

VM storage fault (xene-pool1)

Resolved
Major outage
Started 11 days agoLasted about 11 hours

Affected

Internal Services

Major outage from 1:10 AM to 10:53 AM, Partial outage from 1:10 AM to 10:53 AM, Operational from 10:53 AM to 11:57 AM

Request Tracker

Major outage from 1:10 AM to 10:53 AM, Operational from 10:53 AM to 11:57 AM

Other Internal Services

Partial outage from 1:10 AM to 10:53 AM, Operational from 10:53 AM to 11:57 AM

Virtual Machine Hosting

Partial outage from 1:10 AM to 10:53 AM, Operational from 10:53 AM to 11:57 AM

Main VM Pool (WCDC)

Partial outage from 1:10 AM to 10:53 AM, Operational from 10:53 AM to 11:57 AM

Updates
  • Resolved
    Resolved
    This incident has been resolved. However the same VMs will need to be shut down when a replacement part arrives. This will be communicated separately.
  • Monitoring
    Monitoring
    The fault has been mitigated and affected VMs are now back online. The VMs will have to be shut down again within a few days to replace a failed hardware component. Some users connected to VPN2 may be disconnected shortly as one of the VPN gateway servers needs rebooting even though it is still partially working. Besides this, please contact [service-desk@cst.cam.ac.uk](mailto:service-desk@cst.cam.ac.uk) in case of any remaining problems.
  • Investigating
    Investigating
    Overnight a hardware fault took down the storage server that backs one of our main departmental VM pools (xene-pool1). All VMs running on that pool failed, which included the departmental database server, dbwebserver, Request Tracker, cl-student-ssh, part of the MSA service and the Windows Remote Desktop service.