University of Cambridge Computer Laboratory - GPU cluster storage maintenance – Maintenance details
GPUs experiencing partial outage
GPU cluster storage maintenance
Completed
Scheduled for July 16, 2024 at 4:00 PM – 6:02 PM
Affects
Virtual Machine Hosting
Under maintenance from 4:00 PM to 6:02 PM
GPUs
Under maintenance from 4:00 PM to 6:02 PM
Updates
Completed
July 16, 2024 at 6:02 PM
Completed
July 16, 2024 at 6:02 PM
This maintenance has been completed. Personal VMs can be started via Xen Orchestra (https://xo.cl.cam.ac.uk/) as needed.
Update
July 16, 2024 at 5:36 PM
In progress
July 16, 2024 at 5:36 PM
The outage has overrun due to a problem encountered during the storage server's RAM upgrade. Progress is being made; we can still upgrade the RAM and restore service, just not in the way we expected to. Current estimate for restoration of service: 19:00-19:15.
Update
July 16, 2024 at 4:54 PM
In progress
July 16, 2024 at 4:54 PM
This work is ongoing and is likely to overrun due to an unexpected hardware problem.
In progress
July 16, 2024 at 4:00 PM
In progress
July 16, 2024 at 4:00 PM
Maintenance is now in progress
Update
July 16, 2024 at 4:00 PM
Planned
July 16, 2024 at 4:00 PM
The server that hosts storage for the departmental GPU cluster needs an urgent security update, and a reboot.
This will necessitate shutting down all GPU and CPU development VMs, dev-gpu-* and dev-cpu-1, including the shared servers dev-gpu-1 and dev-cpu-1. These VMs' disks, as well as associated data directories (GPU home directories and shared "gpuscratch" space), will be unavailable for about half an hour. As it will take time to shut down and restart the VM infrastructure, VMs will be unavailable for longer: approximately an hour.
We will take the opportunity to add RAM to the storage server too, to improve performance.
Planned
July 16, 2024 at 11:00 AM
Planned
July 16, 2024 at 11:00 AM
Reminder: this maintenance is taking place at 17:00 today and will require all dev-gpu-* and dev-cpu-* VMs to be shut down.