University of Cambridge Computer Laboratory - Notice history

100% - uptime

Caelum Console (server management) - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

Request Tracker - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

Other Internal Services - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

External Services - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

Network - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime
100% - uptime

GN09 - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

WCDC - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime
100% - uptime

Main VM Pool (WCDC) - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

GPUs - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

Secondary VM Hosts - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

Xen Orchestra - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime
100% - uptime

Filer - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

Archive Server - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

Data Replication - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime

Other Secondary Storage Systems - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023100.0% uptime
Feb 2023100.0% uptime
Mar 2023100.0% uptime
100% - uptime

Third Party: Fastmail → General Availability - Operational

Third Party: Fastmail → Mail delivery - Operational

Third Party: Fastmail → Web client and mobile app - Operational

Third Party: Fastmail → Mail access (IMAP/POP) - Operational

Third Party: Fastmail → Login & sessions - Operational

Third Party: Fastmail → Contacts (CardDAV) - Operational

Notice history

Feb 2023

Archive server: temporarily reduced performance following electrical fault
  • Resolved
    Resolved

    We believe the performance impact of the ongoing data integrity checks is minimal, so are closing the incident.

  • Update
    Update

    The first, and smallest, of archive's three storage pools (CIFS data) has finished its integrity check ("scrub") without encountering any problems. That took 12.5 hours to scrub 32TB.

    Yet to be checked are the NFS data pool (156TB) and then the NFS backup pool (186TB). Extrapolating, it will take weeks to finish checking the data integrity.

    We believe that the performance impact is not excessive. We will continue to monitor the impact whilst we start to scan the busier NFS data pool, and if the performance remains acceptable we will probably then close this incident whilst the remainder of the data is checked.

  • Monitoring
    Monitoring

    NFS shares should now have been restored.

    Slightly impaired performance will continue, potentially for a couple of days, due to ongoing data integrity checks.

  • Identified
    Identified

    Following restart of the archive server, some configuration has been lost -- in particular, some filesystems are no longer shared via NFS. This problem is being investigated.

  • Monitoring
    Monitoring

    The archive server's filesystems are now available again; however performance will be degraded for a while on some filesystems whilst it verifies data integrity.

  • Identified
    Identified

    At 16:39 on 2023-02-08, the West Cambridge Data Centre was affected by a surge on the power grid which affected many systems across the University. In our case, the power fault caused two disk enclosures forming part of the archive server (archive.cl.cam.ac.uk) to briefly disconnect from the server. All volumes on this server are currently unavailable. We are working on restoring service.

Jan 2023

GN09 datacentre electrical upgrade
  • Completed
    13 January, 2023 at 18:14
    Completed
    13 January, 2023 at 18:14

    Maintenance has been completed successfully. All affected servers and network switches are back up (most as of several hours ago).

  • In progress
    13 January, 2023 at 13:52
    In progress
    13 January, 2023 at 13:52

    The UPS electrical maintenance work has finished; we will now start re-energising circuits within the datacentre GN09. This will be a gradual process so expect the disruption to continue for the time being. An update will be posted when all rack power feeds are live.

  • Planned
    13 January, 2023 at 09:00
    Planned
    13 January, 2023 at 09:00

    Work to upgrade power resilience in GN09 will take place from 10th until 13th January 2023, requiring a shutdown of all circuits supplied by our primary UPS for most of 13th January.

    Some servers used for research are supplied solely from those circuits, and will need to be turned off for the day unless alternative arrangements are made. We have some limited capacity to provide alternative power feeds for a subset of servers. Owners of affected machines have been contacted, and must get in touch as instructed if they require their machine to have a temporary power feed on 13th January. Otherwise, such machines will be powered off for the duration.

    A larger set of servers will lose networking for the duration unless an alternative power feed is set up for their rack switch. Owners of such affected machines have also been contacted and must get in touch if the network outage would be too disruptive.

    Servers' out-of-band management (Caelum Console) will be partially unavailable.

    Besides the above, most other machines in GN09 (except the core network and filer which have a secondary UPS) will be without power resilience for the day. Servers with redundant PSUs will temporarily be vulnerable to a single PSU failure. Immediate widespread disruption is possible in the event of a problem with the mains power.

  • Update
    13 January, 2023 at 09:00
    In progress
    13 January, 2023 at 09:00

    Work to upgrade power resilience in GN09 will take place from 10th until 13th January 2023, requiring a shutdown of all circuits supplied by our primary UPS for most of 13th January.

    Some servers used for research are supplied solely from those circuits, and will need to be turned off for the day unless alternative arrangements are made. We have some limited capacity to provide alternative power feeds for a subset of servers. Owners of affected machines have been contacted, and must get in touch as instructed if they require their machine to have a temporary power feed on 13th January. Otherwise, such machines will be powered off for the duration.

    A larger set of servers will lose networking for the duration unless an alternative power feed is set up for their rack switch. Owners of such affected machines have also been contacted and must get in touch if the network outage would be too disruptive.

    Servers' out-of-band management (Caelum Console) will be partially unavailable.

    Besides the above, most other machines in GN09 (except the core network and filer which have a secondary UPS) will be without power resilience for the day. Servers with redundant PSUs will temporarily be vulnerable to a single PSU failure. Immediate widespread disruption is possible in the event of a problem with the mains power.

  • In progress
    13 January, 2023 at 08:47
    In progress
    13 January, 2023 at 08:47

    Shutdown of affected servers will begin shortly.

Reduced capacity for GPU VMs this week
  • Completed
    13 January, 2023 at 13:28
    Completed
    13 January, 2023 at 13:28

    We have been able to provide sufficient GPU capacity again.

  • In progress
    11 January, 2023 at 00:00
    In progress
    11 January, 2023 at 00:00

    In order to reduce the electrical load in preparation for Friday's electrical works, we have had to turn off some of our GPU VM hosts until Friday evening. If you need to use your GPU VM and are unable to start it, contact sys-admin as we may be able to find capacity for you.

Jan 2023 to Mar 2023

Next