University of Cambridge Computer Laboratory - GN09 cooling failure – Incident details

GN09 cooling failure

Resolved
Major outage
Started about 9 hours agoLasted about 9 hours

Affected

Datacentres

Major outage from 10:05 AM to 6:00 PM, Partial outage from 6:00 PM to 6:36 PM

GN09

Major outage from 10:05 AM to 6:00 PM, Partial outage from 6:00 PM to 6:36 PM

Virtual Machine Hosting

Major outage from 10:05 AM to 6:36 PM, Operational from 6:00 PM to 6:36 PM

GPUs

Major outage from 10:05 AM to 6:36 PM

Secondary VM Hosts

Major outage from 10:05 AM to 6:00 PM, Operational from 6:00 PM to 6:36 PM

Data Storage

Major outage from 10:05 AM to 6:00 PM, Operational from 6:00 PM to 6:36 PM

Updates
  • Resolved
    Resolved

    This incident has been resolved. If your server did not boot back up or is not working, and you can't resolve this yourself please contact service-desk@cst.cam.ac.uk.

  • Monitoring
    Monitoring

    The chiller fault has been repaired. Services will now be brought back up (this could take a while). If you have servers in GN09, and access to Caelum Console, you are free to turn on your servers now.

  • Update
    Update

    We are continuing to work on a fix for this incident. A failed part on the chiller is about to be replaced.

  • Identified
    Identified

    Most servers in GN09 have been shut down to protect against further hardware damage. Technicians are on site and investigating a suspected chiller fault.

  • Investigating
    Investigating

    Cooling of GN09 failed overnight. Temperatures are already very high and some systems have failed / powered down. It is likely that more servers will have to be shut down before this is resolved.