University of Cambridge Computer Laboratory - Notice history

100% - uptime

Caelum Console (server management) - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Request Tracker - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Other Internal Services - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

External Services - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Network - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024
100% - uptime

GN09 - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

WCDC - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024
100% - uptime

Main VM Pool (WCDC) - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

GPUs - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Secondary VM Hosts - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Xen Orchestra - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024
100% - uptime

Filer - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Archive Server - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Data Replication - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Other Secondary Storage Systems - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024
100% - uptime

Third Party: Fastmail → General Availability - Operational

Third Party: Fastmail → Mail delivery - Operational

Third Party: Fastmail → Web client and mobile app - Operational

Third Party: Fastmail → Mail access (IMAP/POP) - Operational

Third Party: Fastmail → Login & sessions - Operational

Third Party: Fastmail → Contacts (CardDAV) - Operational

Notice history

Sep 2024

No notices reported this month

Aug 2024

William Gates Building planned power outage
  • Update
    August 19, 2024 at 4:50 PM
    Completed
    August 19, 2024 at 4:50 PM

    The issue affecting power control of servers in GN09 rack 6 has been rectified.

  • Completed
    August 17, 2024 at 1:34 PM
    Completed
    August 17, 2024 at 1:34 PM

    All infrastructure is believed to be operational again after this morning's electrical work, with the exception of power control for a small number of research servers in GN09 rack 6; a PDU has developed a hardware fault. Power is still being supplied, but cannot be turned off or on remotely. Servers can still be turned off or on via their BMCs, but if you need a server power-cycling please contact service-desk@cl.cam.ac.uk.

  • Update
    August 17, 2024 at 11:54 AM
    In progress
    August 17, 2024 at 11:54 AM

    The electrical supply has been restored. We are in the process of restoring infrastructure and then GN09 servers. This is likely to take an hour or two.

  • In progress
    August 17, 2024 at 7:00 AM
    In progress
    August 17, 2024 at 7:00 AM
    Maintenance is now in progress
  • Update
    August 17, 2024 at 7:00 AM
    Planned
    August 17, 2024 at 7:00 AM

    The William Gates Building will be without power for part of Saturday 17th August 2024, due to further planned work on our electrical switch gear on the connection to the building's new solar panels. This additional shutdown is needed to rectify a problem with one of the components installed during the January shutdown.

    Nearly all IT services in the William Gates Building will be unavailable for most of the day.

    Telephones, office networking and wifi will be unavailable all day (but the building is likely to be closed in any case). Please make sure that all computers in offices are shut down - not just asleep - when you leave on Friday.

    We will start shutting down servers at 8am ready for the power to be turned off at around 10am. We expect the power to come back on at approximately 1pm but it will then take some time to bring all systems back into operation.

    We will unfortunately need to shut down all servers in GN09 except for a very small number of critical services such as filer and network infrastructure (which will be powered from a temporary generator), as the cooling system will be offline for several hours and temperatures would otherwise climb to unsafe levels. This includes nearly all research servers and all GPU servers (including GPU VMs). GN09 holds almost all of our server hardware; if you are unsure where your server is located, it is probably in GN09 and will probably be affected. (A very small number of research systems are in the West Cambridge Data Centre, and will not be affected.)

    The outage is not expected to affect core infrastructure, administrative systems or small VMs as these are hosted in the West Cambridge Data Centre. However there is a risk that access to filer from these systems will be disrupted; we don't plan to turn filer off, but it is in GN09, its temporary electrical supply is at risk, and we may have to turn it off if it gets too hot. Where a service is replicated between multiple sites, only one instance of the service may be available (this affects most core services such as LDAP, Active Directory and VPN2).

    VMs hosted by the department will stay running unless they are on the GPU VM clusters (this applies both to VMs with GPUs, and VMs with a lot of CPU cores - generally with names that contain "gpu", "cpu" or "dev").

    Services hosted externally to the department, for example by UIS, will not be affected - for example Moodle, CamSIS, HPC, Exchange email, Fastmail email and the main departmental (CST) website.

  • Planned
    August 16, 2024 at 11:09 AM
    Planned
    August 16, 2024 at 11:09 AM

    Reminder that the William Gates Building's electrical supply will be turned off tomorrow.

    Please fully shut down office PCs before you leave today.

    Research and teaching servers in GN09 will be turned off tomorrow morning, except where already agreed.

Jul 2024

UIS firewall maintenance
  • Completed
    July 29, 2024 at 7:30 AM
    Completed
    July 29, 2024 at 7:30 AM
    Maintenance has completed successfully
  • In progress
    July 29, 2024 at 5:00 AM
    In progress
    July 29, 2024 at 5:00 AM
    Maintenance is now in progress
  • Planned
    July 29, 2024 at 5:00 AM
    Planned
    July 29, 2024 at 5:00 AM

    UIS will be carrying out network maintenance on Monday 29 July from 6am to 8:30am (to physically reconnect a data centre firewall to a new network).

    The central IT services listed below will be unavailable for 10–30 minutes during this period:

    • CamSIS

    • CHRIS

    • CUFS

    • Research dashboard

    • X5

    • University DNS Service

    We recommend waiting until after 08:30 before logging in to the services listed above. They may come back online earlier than 8:30am, so you can try to log in if you have urgent work, but please be aware that you may experience connectivity issues.

    If you experience issues accessing the services after the maintenance period, try logging out and back in. If problems persist, please contact the UIS Service Desk.

Partial wifi outage in GS corridor west
  • Resolved
    Resolved

    The replacement wireless access point for GS corridor west has now been installed. The signal may be variable over the next day or two whilst the system automatically calibrates the new hardware. After that, please report any wifi signal issues to service-desk@cl.cam.ac.uk.

    We have also separately been planning a major upgrade to the wireless network in the building, expected to take place in a few months' time. That should improve the wifi signal and wireless network performance throughout the building.

  • Update
    Update

    We've taken delivery of a replacement wireless access point for the western end of the GS corridor; however some facilities work is needed to install it on the ceiling.

  • Update
    Update

    A replacement wireless access point will be delivered and installed tomorrow.

  • Identified
    Identified

    The wireless access point in GS corridor west has apparently suffered a catastrophic hardware failure, and we are arranging a replacement.

  • Investigating
    Investigating

    We are aware that the wifi access point covering the western end of GS corridor stopped working last night, and are investigating.

Some mail into Fastmail is bouncing
  • Resolved
    Resolved

    We have not seen any reoccurrence of this issue since 2024-07-22 at 14:20, which coincides with when Fastmail fixed another problem, so we hypothesise that they fixed this one too by accident. However they are continuing to investigate.

  • Update
    Update

    Fastmail has clarified that they have fixed only one of two issues that we reported, and it may be the case that mail continues to bounce with "bare <LF>" errors. We are in communication with Fastmail engineers to investigate the remaining problem, which appears to be a complex software bug. Nevertheless mail has not been bouncing for the past 21 hours so the problem may have been fixed by accident.

  • Monitoring
    Monitoring

    Fastmail has now confirmed that they have implemented a fix for the problem.

  • Update
    Update

    We have still not heard back from Fastmail, but mail seems to have stopped bouncing around 14:20 yesterday. We will continue to press Fastmail for an update and will continue to monitor the situation ourselves.

    If you are a Fastmail user and would like us to check our logs for any mail to you that bounced, please contact service-desk@cl.cam.ac.uk.

  • Investigating
    Investigating

    We are aware that Fastmail are bouncing some mail to users of fm.cl.cam.ac.uk with an error message such as "Error: bare <LF> received". We will ask Fastmail to investigate.

GPU cluster storage maintenance
  • Completed
    July 16, 2024 at 6:02 PM
    Completed
    July 16, 2024 at 6:02 PM

    This maintenance has been completed. Personal VMs can be started via Xen Orchestra (https://xo.cl.cam.ac.uk/) as needed.

  • Update
    July 16, 2024 at 5:36 PM
    In progress
    July 16, 2024 at 5:36 PM

    The outage has overrun due to a problem encountered during the storage server's RAM upgrade. Progress is being made; we can still upgrade the RAM and restore service, just not in the way we expected to. Current estimate for restoration of service: 19:00-19:15.

  • Update
    July 16, 2024 at 4:54 PM
    In progress
    July 16, 2024 at 4:54 PM

    This work is ongoing and is likely to overrun due to an unexpected hardware problem.

  • In progress
    July 16, 2024 at 4:00 PM
    In progress
    July 16, 2024 at 4:00 PM
    Maintenance is now in progress
  • Update
    July 16, 2024 at 4:00 PM
    Planned
    July 16, 2024 at 4:00 PM

    The server that hosts storage for the departmental GPU cluster needs an urgent security update, and a reboot.

    This will necessitate shutting down all GPU and CPU development VMs, dev-gpu-* and dev-cpu-1, including the shared servers dev-gpu-1 and dev-cpu-1. These VMs' disks, as well as associated data directories (GPU home directories and shared "gpuscratch" space), will be unavailable for about half an hour. As it will take time to shut down and restart the VM infrastructure, VMs will be unavailable for longer: approximately an hour.

    We will take the opportunity to add RAM to the storage server too, to improve performance.

  • Planned
    July 16, 2024 at 11:00 AM
    Planned
    July 16, 2024 at 11:00 AM

    Reminder: this maintenance is taking place at 17:00 today and will require all dev-gpu-* and dev-cpu-* VMs to be shut down.

Jul 2024 to Sep 2024

Next