University of Cambridge Computer Laboratory - gxp04 GPU VM host fault – Incident details

gxp04 GPU VM host fault

Resolved
Partial outage
Started over 1 year agoLasted about 2 hours

Affected

Virtual Machine Hosting

Partial outage from 8:39 PM to 10:09 PM

GPUs

Partial outage from 8:39 PM to 10:09 PM

Updates
  • Resolved
    Resolved

    The research GPU VM pool is back in normal service.

  • Monitoring
    Monitoring

    Affected VMs were shut down and the faulty host rebooted. VMs can now be started again via Xen Orchestra if needed.

  • Identified
    Identified

    It is necessary to shut down the listed VMs and reboot the affected host.

  • Investigating
    Investigating

    We are investigating a fault on one of the research GPU VM hosts. The following VMs are affected:

    - dev-gpu-cp614
    - dev-gpu-ns779
    - dev-gpu-hj359
    - dev-gpu-rk627
    - dev-gpu-ydk21
    - dev-gpu-tw554
    - dev-gpu-cch57
    - dev-gpu-flo23
    - dev-gpu-aj557

    Management of these VMs via Xen Orchestra is currently intermittently unavailable.