University of Cambridge Computer Laboratory - Chiller fault – Incident details

Chiller fault

Resolved
Major outage
Started about 2 months agoLasted about 6 hours

Affected

Datacentres

Major outage from 3:07 PM to 9:00 PM

GN09

Major outage from 3:07 PM to 9:00 PM

Updates
  • Resolved
    Resolved

    This incident has been resolved. GN09 is fully operational. Most servers that were previously running have been restarted.

    If you have a physical server that is not running, you may be able to start it yourself via https://console.caelum.cl.cam.ac.uk as usual, or contact service-desk@cst.cam.ac.uk.

    VMs that were not set to start automatically have not been restarted. You can start VMs when you need them via https://xo.cl.cam.ac.uk as usual.

    Contact service-desk@cst.cam.ac.uk if there are any remaining issues.

  • Update
    Update

    Cooling has been restored and is expected to remain stable. The cause of the chiller shutting down was the chilled water circulation pumps stopping for some other reason, which will be investigated next week but which we expect to have been an isolated incident. The chiller still has one alarm present which is not preventing operation but is still being investigated.

    We are taking the opportunity of GN09 being shut down to perform some routine firmware and software updates on network hardware and storage systems, so we will not start turning servers back on quite yet, but expect to be able to do so shortly.

  • Update
    Update

    Progress has been made; the chiller is running again but there is a problem still under investigation. We are hopeful that servers can be turned back on again today, but will await the all-clear from the chiller technician.

  • Update
    Update

    Most servers in GN09 are now off, and must remain off until further notice. The emergency technician has arrived and is investigating.

  • Identified
    Identified

    The William Gates Building's chiller has a fault and has stopped running. Temperatures in our on-site data centre GN09 are rising rapidly. Engineers have been called out but it is likely that we will have to start shutting down servers in order to protect them.