University of Cambridge Computer Laboratory - Urgent storage server maintenance – Maintenance details

Urgent storage server maintenance

Completed
Scheduled for November 29, 2024 at 2:00 PM – 3:18 PM

Affects

Virtual Machine Hosting

Under maintenance from 2:00 PM to 3:18 PM

GPUs

Under maintenance from 2:00 PM to 3:18 PM

Updates
  • Completed
    November 29, 2024 at 3:18 PM
    Completed
    November 29, 2024 at 3:18 PM

    This maintenance has been completed. Some dev-gpu/dev-cpu VMs may need rebooting if they lost write access to their filesystems. Please contact service-desk@cst.cam.ac.uk if you encounter any problems.

  • Update
    November 29, 2024 at 3:04 PM
    In progress
    November 29, 2024 at 3:04 PM

    We now think that a storage server reboot will be required. VMs will be paused for a short while. Home directory and gpuscratch access from shared servers will be interrupted for a few minutes.

  • In progress
    November 29, 2024 at 2:00 PM
    In progress
    November 29, 2024 at 2:00 PM
    Maintenance is now in progress
  • Planned
    November 29, 2024 at 2:00 PM
    Planned
    November 29, 2024 at 2:00 PM

    In the aftermath of the GPU cluster storage incident on 17 November, one of the servers that we are using as a temporary host for GPU cluster data needs an urgent power supply upgrade. During the data recovery we added additional SSDs to this server, which unexpectedly caused the system to report that its power usage could now theoretically exceed the capacity of its power supplies (even though in practice it does not).

    We will be replacing the power supplies whilst the server is in use, as we believe this will not be disruptive, but there is a chance of a server reboot which would cause some temporary disruption to the GPU cluster.