Internal Services

95.20% uptime

Caelum Console (server management)

99.30% uptime
Sep 2023 · 100.0%Oct · 100.0%Nov · 97.90%
Sep 2023100.0% uptime
Oct 2023100.0% uptime
Nov 202397.90% uptime

Other Internal Services

91.11% uptime
Sep 2023 · 100.0%Oct · 74.20%Nov · 100.0%
Sep 2023100.0% uptime
Oct 202374.20% uptime
Nov 2023100.0% uptime

External Services

91.11% uptime
Sep 2023 · 100.0%Oct · 74.20%Nov · 100.0%
Sep 2023100.0% uptime
Oct 202374.20% uptime
Nov 2023100.0% uptime

Network

90.84% uptime
Sep 2023 · 100.0%Oct · 73.39%Nov · 100.0%
Sep 2023100.0% uptime
Oct 202373.39% uptime
Nov 2023100.0% uptime

Datacentres

95.56% uptime

GN09

100.0% uptime
Sep 2023 · 100.0%Oct · 100.0%Nov · 100.0%
Sep 2023100.0% uptime
Oct 2023100.0% uptime
Nov 2023100.0% uptime

WCDC

91.11% uptime
Sep 2023 · 100.0%Oct · 74.18%Nov · 100.0%
Sep 2023100.0% uptime
Oct 202374.18% uptime
Nov 2023100.0% uptime

Virtual Machine Hosting

95.56% uptime

Main VM Pool (WCDC)

91.11% uptime
Sep 2023 · 100.0%Oct · 74.20%Nov · 100.0%
Sep 2023100.0% uptime
Oct 202374.20% uptime
Nov 2023100.0% uptime

GPUs

100.0% uptime
Sep 2023 · 100.0%Oct · 100.0%Nov · 100.0%
Sep 2023100.0% uptime
Oct 2023100.0% uptime
Nov 2023100.0% uptime

Secondary VM Hosts

100.0% uptime
Sep 2023 · 100.0%Oct · 100.0%Nov · 100.0%
Sep 2023100.0% uptime
Oct 2023100.0% uptime
Nov 2023100.0% uptime

Xen Orchestra

91.11% uptime
Sep 2023 · 100.0%Oct · 74.20%Nov · 100.0%
Sep 2023100.0% uptime
Oct 202374.20% uptime
Nov 2023100.0% uptime

Data Storage

95.54% uptime

Filer

100.0% uptime
Sep 2023 · 100.0%Oct · 100.0%Nov · 100.0%
Sep 2023100.0% uptime
Oct 2023100.0% uptime
Nov 2023100.0% uptime

Archive Server

91.04% uptime
Sep 2023 · 100.0%Oct · 73.99%Nov · 100.0%
Sep 2023100.0% uptime
Oct 202373.99% uptime
Nov 2023100.0% uptime

Data Replication

100.0% uptime
Sep 2023 · 100.0%Oct · 100.0%Nov · 100.0%
Sep 2023100.0% uptime
Oct 2023100.0% uptime
Nov 2023100.0% uptime

Other Secondary Storage Systems

91.11% uptime
Sep 2023 · 100.0%Oct · 74.20%Nov · 100.0%
Sep 2023100.0% uptime
Oct 202374.20% uptime
Nov 2023100.0% uptime

Fastmail email

100.0% uptime

Third Party: Fastmail → General Availability

Operational

Third Party: Fastmail → Mail delivery

Operational

Third Party: Fastmail → Web client and mobile app

Operational

Third Party: Fastmail → Mail access (IMAP/POP)

Operational

Third Party: Fastmail → Login & sessions

Operational

Third Party: Fastmail → Contacts (CardDAV)

Operational

Notice history

Nov 2023

William Gates Building planned power outage

Oct 2023

WCDC remedial works: ATS replacement
  • Completed
    October 24, 2023 at 2:07 PM
    Completed
    October 24, 2023 at 2:07 PM

    Maintenance has completed successfully.

  • In progress
    October 24, 2023 at 1:30 PM
    In progress
    October 24, 2023 at 1:30 PM

    Maintenance is now in progress

  • Planned
    October 24, 2023 at 1:30 PM
    Planned
    October 24, 2023 at 1:30 PM

    We are planning to replace an Automatic Transfer Switch in one of our racks in the West Cambridge Data Centre, to reduce the likely impact of future partial power outages.

    This device automatically switches other devices (servers, network and management infrastructure) between the data centre's two resilient electrical supplies to allow these to remain powered during an outage of one of the supplies. Though this would not have helped with last week's power outages (which affected both supplies simultaneously), in general the power systems in WCDC are designed to limit outages to a single supply at once. We know that the old ATS that we are currently using is not as reliable as it should be, and have a replacement ready.

    The replacement will result in a loss of power to a few servers (those without their own dual-input power supplies), all of which are part of a resilient service so no user-visible outage is expected:

    • adsrv07 (one of three Active Directory servers for DC.CL.CAM.AC.UK)
    • adsrv03 (one of three Active Directory servers for AD.CL.CAM.AC.UK)
    • sxp12 (one of two DHCP servers)

    It will also result in loss of networking to a few things for about 10 minutes, as the 1Gbps switches will be power-cycled:

    • verex01
    • cctv01
    • tfc-app{1,2,4,5}
    • management of servers in WCDC (BMCs etc.)

    Besides that, no user-visible outage is expected.

Major outage: West Cambridge Data Centre
  • Resolved
    Resolved

    Power to our racks in the West Cambridge Data Centre has been stable since the major electrical incident on 18th October, and following replacement of the Automatic Transfer Switch in our core infrastructure rack, our resilience to future partial power outages is improved.

    The data centre as a whole is however running with reduced power capacity, so the central HPC systems are mostly unavailable. The HPC team currently estimates a return to full service no later than Wednesday 1st November. All HPC users should already be receiving updates from the team by email.

    UIS's remedial works to restore the data centre's power capacity are ongoing. We have had no specific information about these other than that the next such works are tentatively scheduled for next week.

    We are closing this incident now as we have no information to suggest that we will experience further disruption, however due to ongoing repairs to the electrical distribution infrastructure, things may of course change.

  • Monitoring
    Update

    UIS have announced that the main circuit breaker on power supply B will be replaced next week, provisionally on Tuesday at 10am lasting no more than an hour. All our systems should automatically switch over to the alternate power supply (A) and be able to operate from only that supply during this work. However we know that a few older systems may reboot during the transition, or may experience a few minutes' network outage as their local switch reboots.

    We will add a separate scheduled-maintenance incident with more information once UIS has confirmed the timing.

  • Monitoring
    Update

    There are indications from a third party that UIS is planning remedial work for Monday, which may again be disruptive. We will update this incident page as more information becomes available.

  • Monitoring
    Update

    UIS has said that all University services have been restored (with the exception of Research Computing Services / HPC which is expected to be online again this afternoon). We likewise believe that all departmental services are online and stable.

    Please contact sys-admin@cl.cam.ac.uk if anything is not as it should be.

    UIS are planning further remedial work in the data centre next week, which may impact services.

    Thank you for your patience during this incident.

  • Monitoring
    Monitoring

    All services have been brought back up, with the exception (as before) of some personal or group virtual servers where we're not sure which are supposed to be running. If anything is down that should not be (or vice versa) please contact sys-admin.

    Services should be considered at risk, still; we will continue to monitor.

  • Identified
    Update

    We have had no further information from UIS to the University IT community, but we have learned via an external party (JISC) that the second outage at 19:32 was a deliberate controlled power-down and that as of 20:48 no further outage is expected. We are therefore starting to restore services now.

  • Identified
    Update

    We observe that the power has come back on in our WCDC racks, but with no announcement from UIS we are holding off on starting to restore services for a little while as we believe the power to still be unreliable.

    Those machines that automatically powered themselves back on already may experience network outages, as we are taking this opportunity to do a little bit of maintenance that would ordinarily be service-affecting.

  • Identified
    Identified

    The data centre lost power again.

  • Monitoring
    Update

    UIS has noted that although power was restored, there are ongoing electrical problems in the data centre and services should still be considered at risk of further disruption. They have an engineer en route to investigate.

  • Monitoring
    Monitoring

    All services are now believed to be back up, with the possible exception of a few individuals' / research groups' virtual servers, but there is a chance that a few things did not start properly if the servers started before the network and storage infrastructure was ready for them. We are continuing to check services but please now contact sys-admin@cl.cam.ac.uk if you are experiencing any problems.

  • Identified
    Identified

    The UIS West Cambridge Data Centre lost power at around 16:00. Power was restored at around 16:45. Much of our infrastructure was affected and is now restarting. We hope that most services will be back online very soon.

Disruption to Morello cluster network
  • Resolved
    Resolved

    This incident has been resolved (though the entire cluster was power-cycled after the earlier maintenance due to a datacentre-wide power outage).

    A further incident will be opened for the planned replacement of the temporary switch.

  • Monitoring
    Monitoring

    We believe the temporary network setup is working. The faulty switch has been powered down. Please contact sys-admin if anything is not right.

  • Identified
    Update

    Repatching of servers is complete; all Morello systems in rack D9 are now connected to the same port on temporary switch wcdc-d9-sw2 that they were previously connected to on wcdc-d9-sw1.

    There will now be a brief outage to routing as we shut off the layer-3 functionality of wcdc-d9-sw1 and allow wcdc-d10-sw1 to take over.

    (The temporary setup involves D9 being daisy-chained from rack D10. The temporary switch is layer-2 only, so all layer-3 functionality - routing, DHCP - will be handled by wcdc-d10-sw1.)

  • Identified
    Update

    Repatching of Morello servers over to the new temporary switch is about to commence.

  • Identified
    Update

    Replacement of the faulty switch is likely to begin at 12:45. I hope to be able to set up the new temporary switch in parallel with the faulty switch to minimise disruption. Once the new switch is configured, each system will be replugged from the old switch to the new, with hopefully only a few seconds of disruption to each.

    A permanent replacement is being obtained (under warranty) and after that arrives we will plan another short outage to install it.

  • Identified
    Update

    The maintenance on the the rack D10 switch is believed to have been successful though work is ongoing to verify.

    The replacement of the rack D9 switch will probably take place tomorrow (18 Oct) late morning / early afternoon.

    For the time being, we believe there is no service-affecting outage.

  • Identified
    Identified

    The Morello cluster network requires urgent disruptive maintenance; it will imminently experience an outage affecting roughly half of the machines.

    The other machines will experience a longer outage tomorrow, as the switch serving those machines has a fault and needs replacement.

    This only affects the Morello cluster; if you don't know what that is, this incident does not affect you.

Sep 2023

No notices reported this month

Sep 2023 to Nov 2023