University of Cambridge Computer Laboratory - Notice history

100% - uptime

Caelum Console (server management) - Operational

99% - uptime
Nov 2023 · 97.90%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

Request Tracker - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

Other Internal Services - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

External Services - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

Network - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024
100% - uptime

GN09 - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

WCDC - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024
100% - uptime

Main VM Pool (WCDC) - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

GPUs - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 99.96%Jan 2024 · 99.95%
Nov 2023
Dec 2023
Jan 2024

Secondary VM Hosts - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

Xen Orchestra - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024
100% - uptime

Filer - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

Archive Server - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

Data Replication - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024

Other Secondary Storage Systems - Operational

100% - uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023
Dec 2023
Jan 2024
100% - uptime

Third Party: Fastmail → General Availability - Operational

Third Party: Fastmail → Mail delivery - Operational

Third Party: Fastmail → Web client and mobile app - Operational

Third Party: Fastmail → Mail access (IMAP/POP) - Operational

Third Party: Fastmail → Login & sessions - Operational

Third Party: Fastmail → Contacts (CardDAV) - Operational

Notice history

Jan 2024

GPU cluster filesystem outage
  • Resolved
    Resolved

    This incident has been resolved. Affected VMs have either been started again (in the case of CPU VMs and shared servers) or shut down (in the case of GPU VMs). The latter can be started again as needed from https://xo.cl.cam.ac.uk .

    The cause of this disruption was that the server on which VM disks and user data are stored failed to start with the correct network configuration after Sunday's planned shutdown (we suspect due to a bug causing it to make some incorrect changes to the configuration during boot). An attempt to rectify this problem 'live' caused about a minute's disruption, unfortunately just long enough to cause Linux VMs to time out I/O operations and shut down their filesystems.

  • Identified
    Update

    The storage system has been stabilised. Some VMs need a filesystem repair due to the ungraceful disconnection of their disks. Each VM that was running during this incident will be started to ensure that its filesystems are clean, repaired if necessary, then (for GPU VMs) shut down again.

  • Identified
    Identified

    Running VMs will have failed due to their disks briefly becoming inaccessible. Once the server has been stabilised, affected VMs will be shut down and can be started again from XO. However, work is ongoing first to stabilise the server, which experienced an unexpectedly-disruptive problem with its network configuration.

  • Investigating
    Investigating

    GPU VMs (and other VMs running on the same cluster) may have experienced a filesystem fault due to a problem with the storage server currently under investigation.

William Gates Building planned power outage
  • Completed
    January 15, 2024 at 1:44 PM
    Completed
    January 15, 2024 at 1:44 PM

    We think that (except where we're already in communication with the affected users about a specific issue) everything is back to normal after the planned electrical shutdown. Please contact sys-admin if you notice any issues.

  • Update
    January 14, 2024 at 8:03 PM
    In progress
    January 14, 2024 at 8:03 PM

    Datacentre infrastructure has been restored. Owners of servers can now start them via the Caelum console (if access is set up); owners of GPU/CPU development VMs can start them via Xen Orchestra as usual. Contact sys-admin if any needed system is down or misbehaving.

  • Update
    January 14, 2024 at 5:15 PM
    In progress
    January 14, 2024 at 5:15 PM

    Power has been restored to the building. It will now take some time, perhaps hours, to restore all systems starting with core infrastructure. Please be patient if your system remains unavailable.

  • Update
    January 14, 2024 at 3:54 PM
    In progress
    January 14, 2024 at 3:54 PM

    Revised estimate on the restoration of power to the building: 17:00-17:30.

  • In progress
    January 14, 2024 at 8:00 AM
    In progress
    January 14, 2024 at 8:00 AM

    The electrical work is in progress. Systems in GN09 including GPU VMs will remain off until the work is complete, tentatively estimated for 16:00. After that it will take some hours to fully restore all systems.

  • Planned
    January 13, 2024 at 5:00 PM
    Planned
    January 13, 2024 at 5:00 PM

    The William Gates Building will be without power all day on Sunday 14th January 2024, due to planned work on our electrical switch gear to connect our new solar panels. This is the second and final shutdown planned as part of the solar panel installation.

    Nearly all IT services in the William Gates Building will be unavailable for roughly 24 hours, perhaps longer. We will start shutting systems down on the evening of Saturday 13th January ready for the power to be turned off the following morning; we expect the power to come back on during the evening of Sunday 14th January but it will then take some time to bring all systems back into operation. We expect most services to be available by Monday morning, but there is a small chance that a few things won't initially be working properly on Monday.

    Telephones, office networking and wifi will be unavailable all day on Sunday (but the building is likely to be closed in any case). Please make sure that all computers in offices are shut down (not just asleep) before Saturday evening.

    Due to the longer outage this time, we will unfortunately need to shut down all servers in GN09 except for a very small number of critical services such as filer, as the cooling system will be offline all day and temperatures would otherwise climb to unsafe levels.

    This includes nearly all research servers and all GPU servers (including GPU VMs). GN09 holds almost all of our server hardware; if you are unsure where your server is located, it is probably in GN09 and will probably be affected. (A very small number of research systems are in the West Cambridge Data Centre, and will not be affected.)

    The outage is not expected to affect core infrastructure, administrative systems or small VMs as these are hosted in the West Cambridge Data Centre. However there is a risk that access to filer from these systems will be disrupted; we don't plan to turn filer off, but it is in GN09 and we may have to act if it gets too hot. Where a service is replicated between multiple sites, only one instance of the service may be available (this affects most core services such as LDAP, Active Directory and VPN2).

    VMs hosted by the department will stay running unless they are on the GPU VM clusters (this applies both to VMs with GPUs, and VMs with a lot of CPU cores - generally with names that contain "gpu" or "cpu").

    Services hosted externally to the department, for example by UIS, will not be affected - for example Moodle, CamSIS, HPC, Exchange email, Fastmail email and the main departmental (CST) website.

Dec 2023

Nov 2023

William Gates Building planned power outage
  • Completed
    November 25, 2023 at 3:35 PM
    Completed
    November 25, 2023 at 3:35 PM

    This maintenance has been completed successfully.

    If any IT systems are still not working, please contact sys-admin. If any electrical circuits remain down, please contact building-services.

  • Update
    November 25, 2023 at 12:46 PM
    In progress
    November 25, 2023 at 12:46 PM

    Power to the building is gradually being restored. GN09 servers will gradually be powered back up when the cooling has reached temperature.

    Office power and networking in parts of the building will remain down as the maintenance work in wiring cupboards is ongoing.

  • In progress
    November 25, 2023 at 10:03 AM
    In progress
    November 25, 2023 at 10:03 AM

    The start of the electrical work was delayed by two hours due to a generator fault. The work is now under way but is likely to run past 12:00.

  • Planned
    November 25, 2023 at 8:00 AM
    Planned
    November 25, 2023 at 8:00 AM

    The William Gates Building will be without power for the morning of 25th November, due to planned work on our electrical switch gear to facilitate the upcoming commissioning of a substantial amount of solar power generation.

    Electrical circuits in the William Gates Building datacentre, GN09, which are connected via the UPS should remain powered throughout the maintenance, running from a backup generator.

    Other electrical circuits will go down; in particular, a few research servers are connected only to non-UPS circuits. A list of these will be circulated.

    GN09 will also be without active cooling for the duration of this work; we will have temporary air blowers in place to reduce the buildup of hot air but we may need to shut down high-powered servers (for example GPU and FPGA servers) depending on the weather and temperature.

    The following list of machines in GN09 will lose power during this maintenance. If possible, please shut them down before the maintenance starts (otherwise we will try to shut them down by pressing the power button). Once we have announced that the maintenance is complete, you can start them again from the Caelum Console. Please wait for an announcement that the maintenance is complete before attempting to do so.

    Some other servers may be shut down as well, in particular GPU and FPGA servers, to reduce the electrical and/or thermal load in GN09.

    • virtual machines on the GPU cluster (dev-gpu-…, dev-cpu-… and others as notified separately)
    • grumpy
    • gxp06
    • tarawera
    • ngongotaha
    • all quorum servers
    • stix
    • story
    • L51 Raspberry Pi cluster
    • godzilla
    • tiger
    • baume
    • ctsrd-slave2
    • rama
    • cat
    • chericloud-switch
    • rado
    • wenger
    • wolf0/1/2
    • edale
    • glencoe
    • sakura
    • ran
    • nana
    • momo
    • gilling
    • sigyn
    • idun
    • heimdall
    • nikola01/02/03/04
    • acritarch
    • morello101-dev/102-dev/103-dev
    • sleepy
    • doc
    • sherwood
    • behemoth
    • leviathan
    • excalibur
    • bam
    • kinabalu
    • daintree
    • marpe
    • iphito
    • doris
    • asteria
    • all POETS servers
    • mauao
    • any other GPU or FPGA server observed to be drawing a lot of power on Friday evening or Saturday morning
  • Update
    November 25, 2023 at 7:08 AM
    In progress
    November 25, 2023 at 7:08 AM

    Shutdown of listed servers is now beginning.

  • In progress
    November 24, 2023 at 10:14 PM
    In progress
    November 24, 2023 at 10:14 PM

    VMs on the GPU cluster are now shutting down. They can be started again from Xen Orchestra once the electrical work is finished, tomorrow afternoon.

  • Planned
    November 24, 2023 at 6:12 PM
    Planned
    November 24, 2023 at 6:12 PM

    Shutdown of GPU VMs will begin at 22:00 today (Friday).

    Shutdown of affected physical servers will begin at 07:00 on Saturday morning.

Nov 2023 to Jan 2024

Next