University of Cambridge Computer Laboratory - Notice history

100% - uptime

Caelum Console (server management) - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

Request Tracker - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

Other Internal Services - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

External Services - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

Network - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023
100% - uptime

GN09 - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

WCDC - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023
100% - uptime

Main VM Pool (WCDC) - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

GPUs - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

Secondary VM Hosts - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

Xen Orchestra - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023
100% - uptime

Filer - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

Archive Server - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

Data Replication - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023

Other Secondary Storage Systems - Operational

100% - uptime
Jan 2023 · 100.0%Feb · 100.0%Mar · 100.0%
Jan 2023
Feb 2023
Mar 2023
100% - uptime

Third Party: Fastmail → General Availability - Operational

Third Party: Fastmail → Mail delivery - Operational

Third Party: Fastmail → Web client and mobile app - Operational

Third Party: Fastmail → Mail access (IMAP/POP) - Operational

Third Party: Fastmail → Login & sessions - Operational

Third Party: Fastmail → Contacts (CardDAV) - Operational

Notice history

Feb 2023

Archive server: temporarily reduced performance following electrical fault
  • Resolved
    Resolved

    We believe the performance impact of the ongoing data integrity checks is minimal, so are closing the incident.

  • Monitoring
    Update

    The first, and smallest, of archive's three storage pools (CIFS data) has finished its integrity check ("scrub") without encountering any problems. That took 12.5 hours to scrub 32TB.

    Yet to be checked are the NFS data pool (156TB) and then the NFS backup pool (186TB). Extrapolating, it will take weeks to finish checking the data integrity.

    We believe that the performance impact is not excessive. We will continue to monitor the impact whilst we start to scan the busier NFS data pool, and if the performance remains acceptable we will probably then close this incident whilst the remainder of the data is checked.

  • Monitoring
    Monitoring

    NFS shares should now have been restored.

    Slightly impaired performance will continue, potentially for a couple of days, due to ongoing data integrity checks.

  • Identified
    Identified

    Following restart of the archive server, some configuration has been lost -- in particular, some filesystems are no longer shared via NFS. This problem is being investigated.

  • Monitoring
    Monitoring

    The archive server's filesystems are now available again; however performance will be degraded for a while on some filesystems whilst it verifies data integrity.

  • Identified
    Identified

    At 16:39 on 2023-02-08, the West Cambridge Data Centre was affected by a surge on the power grid which affected many systems across the University. In our case, the power fault caused two disk enclosures forming part of the archive server (archive.cl.cam.ac.uk) to briefly disconnect from the server. All volumes on this server are currently unavailable. We are working on restoring service.

Jan 2023

GN09 datacentre electrical upgrade
  • Completed
    January 13, 2023 at 6:14 PM
    Completed
    January 13, 2023 at 6:14 PM

    Maintenance has been completed successfully. All affected servers and network switches are back up (most as of several hours ago).

  • In progress
    January 13, 2023 at 1:52 PM
    In progress
    January 13, 2023 at 1:52 PM

    The UPS electrical maintenance work has finished; we will now start re-energising circuits within the datacentre GN09. This will be a gradual process so expect the disruption to continue for the time being. An update will be posted when all rack power feeds are live.

  • Planned
    January 13, 2023 at 9:00 AM
    Planned
    January 13, 2023 at 9:00 AM

    Work to upgrade power resilience in GN09 will take place from 10th until 13th January 2023, requiring a shutdown of all circuits supplied by our primary UPS for most of 13th January.

    Some servers used for research are supplied solely from those circuits, and will need to be turned off for the day unless alternative arrangements are made. We have some limited capacity to provide alternative power feeds for a subset of servers. Owners of affected machines have been contacted, and must get in touch as instructed if they require their machine to have a temporary power feed on 13th January. Otherwise, such machines will be powered off for the duration.

    A larger set of servers will lose networking for the duration unless an alternative power feed is set up for their rack switch. Owners of such affected machines have also been contacted and must get in touch if the network outage would be too disruptive.

    Servers' out-of-band management (Caelum Console) will be partially unavailable.

    Besides the above, most other machines in GN09 (except the core network and filer which have a secondary UPS) will be without power resilience for the day. Servers with redundant PSUs will temporarily be vulnerable to a single PSU failure. Immediate widespread disruption is possible in the event of a problem with the mains power.

  • Update
    January 13, 2023 at 9:00 AM
    In progress
    January 13, 2023 at 9:00 AM

    Work to upgrade power resilience in GN09 will take place from 10th until 13th January 2023, requiring a shutdown of all circuits supplied by our primary UPS for most of 13th January.

    Some servers used for research are supplied solely from those circuits, and will need to be turned off for the day unless alternative arrangements are made. We have some limited capacity to provide alternative power feeds for a subset of servers. Owners of affected machines have been contacted, and must get in touch as instructed if they require their machine to have a temporary power feed on 13th January. Otherwise, such machines will be powered off for the duration.

    A larger set of servers will lose networking for the duration unless an alternative power feed is set up for their rack switch. Owners of such affected machines have also been contacted and must get in touch if the network outage would be too disruptive.

    Servers' out-of-band management (Caelum Console) will be partially unavailable.

    Besides the above, most other machines in GN09 (except the core network and filer which have a secondary UPS) will be without power resilience for the day. Servers with redundant PSUs will temporarily be vulnerable to a single PSU failure. Immediate widespread disruption is possible in the event of a problem with the mains power.

  • In progress
    January 13, 2023 at 8:47 AM
    In progress
    January 13, 2023 at 8:47 AM

    Shutdown of affected servers will begin shortly.

Reduced capacity for GPU VMs this week
  • Completed
    January 13, 2023 at 1:28 PM
    Completed
    January 13, 2023 at 1:28 PM

    We have been able to provide sufficient GPU capacity again.

  • In progress
    January 11, 2023 at 12:00 AM
    In progress
    January 11, 2023 at 12:00 AM

    In order to reduce the electrical load in preparation for Friday's electrical works, we have had to turn off some of our GPU VM hosts until Friday evening. If you need to use your GPU VM and are unable to start it, contact sys-admin as we may be able to find capacity for you.

Jan 2023 to Mar 2023

Next