University of Cambridge Computer Laboratory - Urgent storage server maintenance – Maintenance details
All systems operational
Urgent storage server maintenance
Completed
Scheduled for November 29, 2024 at 2:00 PM – 3:18 PM
Affects
Virtual Machine Hosting
Under maintenance from 2:00 PM to 3:18 PM
GPUs
Under maintenance from 2:00 PM to 3:18 PM
Updates
Completed
November 29, 2024 at 3:18 PM
Completed
November 29, 2024 at 3:18 PM
This maintenance has been completed. Some dev-gpu/dev-cpu VMs may need rebooting if they lost write access to their filesystems. Please contact service-desk@cst.cam.ac.uk if you encounter any problems.
Update
November 29, 2024 at 3:04 PM
In progress
November 29, 2024 at 3:04 PM
We now think that a storage server reboot will be required. VMs will be paused for a short while. Home directory and gpuscratch access from shared servers will be interrupted for a few minutes.
In progress
November 29, 2024 at 2:00 PM
In progress
November 29, 2024 at 2:00 PM
Maintenance is now in progress
Planned
November 29, 2024 at 2:00 PM
Planned
November 29, 2024 at 2:00 PM
In the aftermath of the GPU cluster storage incident on 17 November, one of the servers that we are using as a temporary host for GPU cluster data needs an urgent power supply upgrade. During the data recovery we added additional SSDs to this server, which unexpectedly caused the system to report that its power usage could now theoretically exceed the capacity of its power supplies (even though in practice it does not).
We will be replacing the power supplies whilst the server is in use, as we believe this will not be disruptive, but there is a chance of a server reboot which would cause some temporary disruption to the GPU cluster.