University of Cambridge Computer Laboratory - Urgent storage server maintenance – Maintenance details

Urgent storage server maintenance

Completed
Scheduled for 29 November, 2024 at 14:00 – 15:18

Affects

Virtual Machine Hosting

Under maintenance from 2:00 PM to 3:18 PM

GPUs

Under maintenance from 2:00 PM to 3:18 PM

Updates
  • Completed
    29 November, 2024 at 15:18
    Completed
    29 November, 2024 at 15:18

    This maintenance has been completed. Some dev-gpu/dev-cpu VMs may need rebooting if they lost write access to their filesystems. Please contact service-desk@cst.cam.ac.uk if you encounter any problems.

  • Update
    29 November, 2024 at 15:04
    In progress
    29 November, 2024 at 15:04

    We now think that a storage server reboot will be required. VMs will be paused for a short while. Home directory and gpuscratch access from shared servers will be interrupted for a few minutes.

  • In progress
    29 November, 2024 at 14:00
    In progress
    29 November, 2024 at 14:00
    Maintenance is now in progress
  • Planned
    29 November, 2024 at 14:00
    Planned
    29 November, 2024 at 14:00

    In the aftermath of the GPU cluster storage incident on 17 November, one of the servers that we are using as a temporary host for GPU cluster data needs an urgent power supply upgrade. During the data recovery we added additional SSDs to this server, which unexpectedly caused the system to report that its power usage could now theoretically exceed the capacity of its power supplies (even though in practice it does not).

    We will be replacing the power supplies whilst the server is in use, as we believe this will not be disruptive, but there is a chance of a server reboot which would cause some temporary disruption to the GPU cluster.