University of Cambridge Computer Laboratory - Some GPU cluster VMs crashed – Incident details

Some GPU cluster VMs crashed

Resolved
Partial outage
Started 8 months agoLasted about 1 hour

Affected

Virtual Machine Hosting

Partial outage from 2:17 AM to 3:16 AM

GPUs

Partial outage from 2:17 AM to 3:16 AM

Updates
  • Resolved
    Resolved

    This incident has been resolved.

  • Identified
    Identified

    Due to a problem during network maintenance, VMs on the departmental GPU cluster briefly lost access to their disks. This caused some VMs to crash. Affected VMs will be rebooted (if CPU VMs) or shut down (if GPU VMs); the latter can be started again from XO.