University of Cambridge Computer Laboratory - Major outage: West Cambridge Data Centre – Incident details

Major outage: West Cambridge Data Centre

Resolved
Major outage
Started 9 months agoLasted 8 days

Affected

Internal Services

Major outage from 3:00 PM to 2:57 PM

Other Internal Services

Major outage from 3:00 PM to 2:57 PM

External Services

Major outage from 3:00 PM to 2:57 PM

Network

Major outage from 3:00 PM to 2:57 PM

Datacentres

Major outage from 3:00 PM to 2:57 PM

WCDC

Major outage from 3:00 PM to 2:57 PM

Updates
  • Resolved
    Resolved

    Power to our racks in the West Cambridge Data Centre has been stable since the major electrical incident on 18th October, and following replacement of the Automatic Transfer Switch in our core infrastructure rack, our resilience to future partial power outages is improved.

    The data centre as a whole is however running with reduced power capacity, so the central HPC systems are mostly unavailable. The HPC team currently estimates a return to full service no later than Wednesday 1st November. All HPC users should already be receiving updates from the team by email.

    UIS's remedial works to restore the data centre's power capacity are ongoing. We have had no specific information about these other than that the next such works are tentatively scheduled for next week.

    We are closing this incident now as we have no information to suggest that we will experience further disruption, however due to ongoing repairs to the electrical distribution infrastructure, things may of course change.

  • Monitoring
    Update

    UIS have announced that the main circuit breaker on power supply B will be replaced next week, provisionally on Tuesday at 10am lasting no more than an hour. All our systems should automatically switch over to the alternate power supply (A) and be able to operate from only that supply during this work. However we know that a few older systems may reboot during the transition, or may experience a few minutes' network outage as their local switch reboots.

    We will add a separate scheduled-maintenance incident with more information once UIS has confirmed the timing.

  • Monitoring
    Update

    There are indications from a third party that UIS is planning remedial work for Monday, which may again be disruptive. We will update this incident page as more information becomes available.

  • Monitoring
    Update

    UIS has said that all University services have been restored (with the exception of Research Computing Services / HPC which is expected to be online again this afternoon). We likewise believe that all departmental services are online and stable.

    Please contact sys-admin@cl.cam.ac.uk if anything is not as it should be.

    UIS are planning further remedial work in the data centre next week, which may impact services.

    Thank you for your patience during this incident.

  • Monitoring
    Monitoring

    All services have been brought back up, with the exception (as before) of some personal or group virtual servers where we're not sure which are supposed to be running. If anything is down that should not be (or vice versa) please contact sys-admin.

    Services should be considered at risk, still; we will continue to monitor.

  • Identified
    Update

    We have had no further information from UIS to the University IT community, but we have learned via an external party (JISC) that the second outage at 19:32 was a deliberate controlled power-down and that as of 20:48 no further outage is expected. We are therefore starting to restore services now.

  • Identified
    Update

    We observe that the power has come back on in our WCDC racks, but with no announcement from UIS we are holding off on starting to restore services for a little while as we believe the power to still be unreliable.

    Those machines that automatically powered themselves back on already may experience network outages, as we are taking this opportunity to do a little bit of maintenance that would ordinarily be service-affecting.

  • Identified
    Identified

    The data centre lost power again.

  • Monitoring
    Update

    UIS has noted that although power was restored, there are ongoing electrical problems in the data centre and services should still be considered at risk of further disruption. They have an engineer en route to investigate.

  • Monitoring
    Monitoring

    All services are now believed to be back up, with the possible exception of a few individuals' / research groups' virtual servers, but there is a chance that a few things did not start properly if the servers started before the network and storage infrastructure was ready for them. We are continuing to check services but please now contact sys-admin@cl.cam.ac.uk if you are experiencing any problems.

  • Identified
    Identified

    The UIS West Cambridge Data Centre lost power at around 16:00. Power was restored at around 16:45. Much of our infrastructure was affected and is now restarting. We hope that most services will be back online very soon.