Internal Services

100.0% uptime

Caelum Console (server management)

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

Other Internal Services

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

External Services

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

Network

99.74% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 99.18%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 202499.18% uptime

Datacentres

100.0% uptime

GN09

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

WCDC

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

Virtual Machine Hosting

99.99% uptime

Main VM Pool (WCDC)

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

GPUs

99.97% uptime
Dec 2023 · 99.96%Jan 2024 · 99.95%Feb · 100.0%
Dec 202399.96% uptime
Jan 202499.95% uptime
Feb 2024100.0% uptime

Secondary VM Hosts

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

Xen Orchestra

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

Data Storage

99.99% uptime

Filer

99.96% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 99.87%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 202499.87% uptime

Archive Server

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

Data Replication

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

Other Secondary Storage Systems

100.0% uptime
Dec 2023 · 100.0%Jan 2024 · 100.0%Feb · 100.0%
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Feb 2024100.0% uptime

Fastmail email

100.0% uptime

Third Party: Fastmail → General Availability

Operational

Third Party: Fastmail → Mail delivery

Operational

Third Party: Fastmail → Web client and mobile app

Operational

Third Party: Fastmail → Mail access (IMAP/POP)

Operational

Third Party: Fastmail → Login & sessions

Operational

Third Party: Fastmail → Contacts (CardDAV)

Operational

Notice history

Feb 2024

Jan 2024

GPU cluster filesystem outage
  • Resolved
    Resolved

    This incident has been resolved. Affected VMs have either been started again (in the case of CPU VMs and shared servers) or shut down (in the case of GPU VMs). The latter can be started again as needed from https://xo.cl.cam.ac.uk .

    The cause of this disruption was that the server on which VM disks and user data are stored failed to start with the correct network configuration after Sunday's planned shutdown (we suspect due to a bug causing it to make some incorrect changes to the configuration during boot). An attempt to rectify this problem 'live' caused about a minute's disruption, unfortunately just long enough to cause Linux VMs to time out I/O operations and shut down their filesystems.

  • Identified
    Update

    The storage system has been stabilised. Some VMs need a filesystem repair due to the ungraceful disconnection of their disks. Each VM that was running during this incident will be started to ensure that its filesystems are clean, repaired if necessary, then (for GPU VMs) shut down again.

  • Identified
    Identified

    Running VMs will have failed due to their disks briefly becoming inaccessible. Once the server has been stabilised, affected VMs will be shut down and can be started again from XO. However, work is ongoing first to stabilise the server, which experienced an unexpectedly-disruptive problem with its network configuration.

  • Investigating
    Investigating

    GPU VMs (and other VMs running on the same cluster) may have experienced a filesystem fault due to a problem with the storage server currently under investigation.

William Gates Building planned power outage
  • Completed
    January 15, 2024 at 1:44 PM
    Completed
    January 15, 2024 at 1:44 PM

    We think that (except where we're already in communication with the affected users about a specific issue) everything is back to normal after the planned electrical shutdown. Please contact sys-admin if you notice any issues.

  • Update
    January 14, 2024 at 8:03 PM
    In progress
    January 14, 2024 at 8:03 PM

    Datacentre infrastructure has been restored. Owners of servers can now start them via the Caelum console (if access is set up); owners of GPU/CPU development VMs can start them via Xen Orchestra as usual. Contact sys-admin if any needed system is down or misbehaving.

  • Update
    January 14, 2024 at 5:15 PM
    In progress
    January 14, 2024 at 5:15 PM

    Power has been restored to the building. It will now take some time, perhaps hours, to restore all systems starting with core infrastructure. Please be patient if your system remains unavailable.

  • Update
    January 14, 2024 at 3:54 PM
    In progress
    January 14, 2024 at 3:54 PM

    Revised estimate on the restoration of power to the building: 17:00-17:30.

  • In progress
    January 14, 2024 at 8:00 AM
    In progress
    January 14, 2024 at 8:00 AM

    The electrical work is in progress. Systems in GN09 including GPU VMs will remain off until the work is complete, tentatively estimated for 16:00. After that it will take some hours to fully restore all systems.

  • Planned
    January 13, 2024 at 5:00 PM
    Planned
    January 13, 2024 at 5:00 PM

    The William Gates Building will be without power all day on Sunday 14th January 2024, due to planned work on our electrical switch gear to connect our new solar panels. This is the second and final shutdown planned as part of the solar panel installation.

    Nearly all IT services in the William Gates Building will be unavailable for roughly 24 hours, perhaps longer. We will start shutting systems down on the evening of Saturday 13th January ready for the power to be turned off the following morning; we expect the power to come back on during the evening of Sunday 14th January but it will then take some time to bring all systems back into operation. We expect most services to be available by Monday morning, but there is a small chance that a few things won't initially be working properly on Monday.

    Telephones, office networking and wifi will be unavailable all day on Sunday (but the building is likely to be closed in any case). Please make sure that all computers in offices are shut down (not just asleep) before Saturday evening.

    Due to the longer outage this time, we will unfortunately need to shut down all servers in GN09 except for a very small number of critical services such as filer, as the cooling system will be offline all day and temperatures would otherwise climb to unsafe levels.

    This includes nearly all research servers and all GPU servers (including GPU VMs). GN09 holds almost all of our server hardware; if you are unsure where your server is located, it is probably in GN09 and will probably be affected. (A very small number of research systems are in the West Cambridge Data Centre, and will not be affected.)

    The outage is not expected to affect core infrastructure, administrative systems or small VMs as these are hosted in the West Cambridge Data Centre. However there is a risk that access to filer from these systems will be disrupted; we don't plan to turn filer off, but it is in GN09 and we may have to act if it gets too hot. Where a service is replicated between multiple sites, only one instance of the service may be available (this affects most core services such as LDAP, Active Directory and VPN2).

    VMs hosted by the department will stay running unless they are on the GPU VM clusters (this applies both to VMs with GPUs, and VMs with a lot of CPU cores - generally with names that contain "gpu" or "cpu").

    Services hosted externally to the department, for example by UIS, will not be affected - for example Moodle, CamSIS, HPC, Exchange email, Fastmail email and the main departmental (CST) website.

Dec 2023

Dec 2023 to Feb 2024