University of Cambridge Computer Laboratory - Notice history

97% - uptime

Caelum Console (server management) - Operational

99% - uptime
Oct 2023 · 100.0%Nov · 97.90%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

Request Tracker - Operational

100% - uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

Other Internal Services - Operational

91% - uptime
Oct 2023 · 74.20%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

External Services - Operational

91% - uptime
Oct 2023 · 74.20%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

Network - Operational

91% - uptime
Oct 2023 · 73.39%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023
96% - uptime

GN09 - Operational

100% - uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

WCDC - Operational

91% - uptime
Oct 2023 · 74.18%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023
96% - uptime

Main VM Pool (WCDC) - Operational

91% - uptime
Oct 2023 · 74.20%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

GPUs - Operational

100% - uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 99.96%
Oct 2023
Nov 2023
Dec 2023

Secondary VM Hosts - Operational

100% - uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

Xen Orchestra - Operational

91% - uptime
Oct 2023 · 74.20%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023
96% - uptime

Filer - Operational

100% - uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

Archive Server - Operational

91% - uptime
Oct 2023 · 73.99%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

Data Replication - Operational

100% - uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023

Other Secondary Storage Systems - Operational

91% - uptime
Oct 2023 · 74.20%Nov · 100.0%Dec · 100.0%
Oct 2023
Nov 2023
Dec 2023
100% - uptime

Third Party: Fastmail → General Availability - Operational

Third Party: Fastmail → Mail delivery - Operational

Third Party: Fastmail → Web client and mobile app - Operational

Third Party: Fastmail → Mail access (IMAP/POP) - Operational

Third Party: Fastmail → Login & sessions - Operational

Third Party: Fastmail → Contacts (CardDAV) - Operational

Notice history

Dec 2023

Nov 2023

William Gates Building planned power outage
  • Completed
    November 25, 2023 at 3:35 PM
    Completed
    November 25, 2023 at 3:35 PM

    This maintenance has been completed successfully.

    If any IT systems are still not working, please contact sys-admin. If any electrical circuits remain down, please contact building-services.

  • Update
    November 25, 2023 at 12:46 PM
    In progress
    November 25, 2023 at 12:46 PM

    Power to the building is gradually being restored. GN09 servers will gradually be powered back up when the cooling has reached temperature.

    Office power and networking in parts of the building will remain down as the maintenance work in wiring cupboards is ongoing.

  • In progress
    November 25, 2023 at 10:03 AM
    In progress
    November 25, 2023 at 10:03 AM

    The start of the electrical work was delayed by two hours due to a generator fault. The work is now under way but is likely to run past 12:00.

  • Planned
    November 25, 2023 at 8:00 AM
    Planned
    November 25, 2023 at 8:00 AM

    The William Gates Building will be without power for the morning of 25th November, due to planned work on our electrical switch gear to facilitate the upcoming commissioning of a substantial amount of solar power generation.

    Electrical circuits in the William Gates Building datacentre, GN09, which are connected via the UPS should remain powered throughout the maintenance, running from a backup generator.

    Other electrical circuits will go down; in particular, a few research servers are connected only to non-UPS circuits. A list of these will be circulated.

    GN09 will also be without active cooling for the duration of this work; we will have temporary air blowers in place to reduce the buildup of hot air but we may need to shut down high-powered servers (for example GPU and FPGA servers) depending on the weather and temperature.

    The following list of machines in GN09 will lose power during this maintenance. If possible, please shut them down before the maintenance starts (otherwise we will try to shut them down by pressing the power button). Once we have announced that the maintenance is complete, you can start them again from the Caelum Console. Please wait for an announcement that the maintenance is complete before attempting to do so.

    Some other servers may be shut down as well, in particular GPU and FPGA servers, to reduce the electrical and/or thermal load in GN09.

    • virtual machines on the GPU cluster (dev-gpu-…, dev-cpu-… and others as notified separately)
    • grumpy
    • gxp06
    • tarawera
    • ngongotaha
    • all quorum servers
    • stix
    • story
    • L51 Raspberry Pi cluster
    • godzilla
    • tiger
    • baume
    • ctsrd-slave2
    • rama
    • cat
    • chericloud-switch
    • rado
    • wenger
    • wolf0/1/2
    • edale
    • glencoe
    • sakura
    • ran
    • nana
    • momo
    • gilling
    • sigyn
    • idun
    • heimdall
    • nikola01/02/03/04
    • acritarch
    • morello101-dev/102-dev/103-dev
    • sleepy
    • doc
    • sherwood
    • behemoth
    • leviathan
    • excalibur
    • bam
    • kinabalu
    • daintree
    • marpe
    • iphito
    • doris
    • asteria
    • all POETS servers
    • mauao
    • any other GPU or FPGA server observed to be drawing a lot of power on Friday evening or Saturday morning
  • Update
    November 25, 2023 at 7:08 AM
    In progress
    November 25, 2023 at 7:08 AM

    Shutdown of listed servers is now beginning.

  • In progress
    November 24, 2023 at 10:14 PM
    In progress
    November 24, 2023 at 10:14 PM

    VMs on the GPU cluster are now shutting down. They can be started again from Xen Orchestra once the electrical work is finished, tomorrow afternoon.

  • Planned
    November 24, 2023 at 6:12 PM
    Planned
    November 24, 2023 at 6:12 PM

    Shutdown of GPU VMs will begin at 22:00 today (Friday).

    Shutdown of affected physical servers will begin at 07:00 on Saturday morning.

Oct 2023

WCDC remedial works: ATS replacement
  • Completed
    October 24, 2023 at 2:07 PM
    Completed
    October 24, 2023 at 2:07 PM

    Maintenance has completed successfully.

  • In progress
    October 24, 2023 at 1:30 PM
    In progress
    October 24, 2023 at 1:30 PM

    Maintenance is now in progress

  • Planned
    October 24, 2023 at 1:30 PM
    Planned
    October 24, 2023 at 1:30 PM

    We are planning to replace an Automatic Transfer Switch in one of our racks in the West Cambridge Data Centre, to reduce the likely impact of future partial power outages.

    This device automatically switches other devices (servers, network and management infrastructure) between the data centre's two resilient electrical supplies to allow these to remain powered during an outage of one of the supplies. Though this would not have helped with last week's power outages (which affected both supplies simultaneously), in general the power systems in WCDC are designed to limit outages to a single supply at once. We know that the old ATS that we are currently using is not as reliable as it should be, and have a replacement ready.

    The replacement will result in a loss of power to a few servers (those without their own dual-input power supplies), all of which are part of a resilient service so no user-visible outage is expected:

    • adsrv07 (one of three Active Directory servers for DC.CL.CAM.AC.UK)
    • adsrv03 (one of three Active Directory servers for AD.CL.CAM.AC.UK)
    • sxp12 (one of two DHCP servers)

    It will also result in loss of networking to a few things for about 10 minutes, as the 1Gbps switches will be power-cycled:

    • verex01
    • cctv01
    • tfc-app{1,2,4,5}
    • management of servers in WCDC (BMCs etc.)

    Besides that, no user-visible outage is expected.

Major outage: West Cambridge Data Centre
  • Resolved
    Resolved

    Power to our racks in the West Cambridge Data Centre has been stable since the major electrical incident on 18th October, and following replacement of the Automatic Transfer Switch in our core infrastructure rack, our resilience to future partial power outages is improved.

    The data centre as a whole is however running with reduced power capacity, so the central HPC systems are mostly unavailable. The HPC team currently estimates a return to full service no later than Wednesday 1st November. All HPC users should already be receiving updates from the team by email.

    UIS's remedial works to restore the data centre's power capacity are ongoing. We have had no specific information about these other than that the next such works are tentatively scheduled for next week.

    We are closing this incident now as we have no information to suggest that we will experience further disruption, however due to ongoing repairs to the electrical distribution infrastructure, things may of course change.

  • Monitoring
    Update

    UIS have announced that the main circuit breaker on power supply B will be replaced next week, provisionally on Tuesday at 10am lasting no more than an hour. All our systems should automatically switch over to the alternate power supply (A) and be able to operate from only that supply during this work. However we know that a few older systems may reboot during the transition, or may experience a few minutes' network outage as their local switch reboots.

    We will add a separate scheduled-maintenance incident with more information once UIS has confirmed the timing.

  • Monitoring
    Update

    There are indications from a third party that UIS is planning remedial work for Monday, which may again be disruptive. We will update this incident page as more information becomes available.

  • Monitoring
    Update

    UIS has said that all University services have been restored (with the exception of Research Computing Services / HPC which is expected to be online again this afternoon). We likewise believe that all departmental services are online and stable.

    Please contact sys-admin@cl.cam.ac.uk if anything is not as it should be.

    UIS are planning further remedial work in the data centre next week, which may impact services.

    Thank you for your patience during this incident.

  • Monitoring
    Monitoring

    All services have been brought back up, with the exception (as before) of some personal or group virtual servers where we're not sure which are supposed to be running. If anything is down that should not be (or vice versa) please contact sys-admin.

    Services should be considered at risk, still; we will continue to monitor.

  • Identified
    Update

    We have had no further information from UIS to the University IT community, but we have learned via an external party (JISC) that the second outage at 19:32 was a deliberate controlled power-down and that as of 20:48 no further outage is expected. We are therefore starting to restore services now.

  • Identified
    Update

    We observe that the power has come back on in our WCDC racks, but with no announcement from UIS we are holding off on starting to restore services for a little while as we believe the power to still be unreliable.

    Those machines that automatically powered themselves back on already may experience network outages, as we are taking this opportunity to do a little bit of maintenance that would ordinarily be service-affecting.

  • Identified
    Identified

    The data centre lost power again.

  • Monitoring
    Update

    UIS has noted that although power was restored, there are ongoing electrical problems in the data centre and services should still be considered at risk of further disruption. They have an engineer en route to investigate.

  • Monitoring
    Monitoring

    All services are now believed to be back up, with the possible exception of a few individuals' / research groups' virtual servers, but there is a chance that a few things did not start properly if the servers started before the network and storage infrastructure was ready for them. We are continuing to check services but please now contact sys-admin@cl.cam.ac.uk if you are experiencing any problems.

  • Identified
    Identified

    The UIS West Cambridge Data Centre lost power at around 16:00. Power was restored at around 16:45. Much of our infrastructure was affected and is now restarting. We hope that most services will be back online very soon.

Disruption to Morello cluster network
  • Resolved
    Resolved

    This incident has been resolved (though the entire cluster was power-cycled after the earlier maintenance due to a datacentre-wide power outage).

    A further incident will be opened for the planned replacement of the temporary switch.

  • Monitoring
    Monitoring

    We believe the temporary network setup is working. The faulty switch has been powered down. Please contact sys-admin if anything is not right.

  • Identified
    Update

    Repatching of servers is complete; all Morello systems in rack D9 are now connected to the same port on temporary switch wcdc-d9-sw2 that they were previously connected to on wcdc-d9-sw1.

    There will now be a brief outage to routing as we shut off the layer-3 functionality of wcdc-d9-sw1 and allow wcdc-d10-sw1 to take over.

    (The temporary setup involves D9 being daisy-chained from rack D10. The temporary switch is layer-2 only, so all layer-3 functionality - routing, DHCP - will be handled by wcdc-d10-sw1.)

  • Identified
    Update

    Repatching of Morello servers over to the new temporary switch is about to commence.

  • Identified
    Update

    Replacement of the faulty switch is likely to begin at 12:45. I hope to be able to set up the new temporary switch in parallel with the faulty switch to minimise disruption. Once the new switch is configured, each system will be replugged from the old switch to the new, with hopefully only a few seconds of disruption to each.

    A permanent replacement is being obtained (under warranty) and after that arrives we will plan another short outage to install it.

  • Identified
    Update

    The maintenance on the the rack D10 switch is believed to have been successful though work is ongoing to verify.

    The replacement of the rack D9 switch will probably take place tomorrow (18 Oct) late morning / early afternoon.

    For the time being, we believe there is no service-affecting outage.

  • Identified
    Identified

    The Morello cluster network requires urgent disruptive maintenance; it will imminently experience an outage affecting roughly half of the machines.

    The other machines will experience a longer outage tomorrow, as the switch serving those machines has a fault and needs replacement.

    This only affects the Morello cluster; if you don't know what that is, this incident does not affect you.

Oct 2023 to Dec 2023

Next