University of Cambridge Computer Laboratory - GPU cluster storage maintenance – Maintenance details

GPU cluster storage maintenance

Completed
Scheduled for 16 July, 2024 at 16:00 – 18:02

Affects

Virtual Machine Hosting

Under maintenance from 4:00 PM to 6:02 PM

GPUs

Under maintenance from 4:00 PM to 6:02 PM

Updates
  • Completed
    16 July, 2024 at 18:02
    Completed
    16 July, 2024 at 18:02

    This maintenance has been completed. Personal VMs can be started via Xen Orchestra (https://xo.cl.cam.ac.uk/) as needed.

  • Update
    16 July, 2024 at 17:36
    In progress
    16 July, 2024 at 17:36

    The outage has overrun due to a problem encountered during the storage server's RAM upgrade. Progress is being made; we can still upgrade the RAM and restore service, just not in the way we expected to. Current estimate for restoration of service: 19:00-19:15.

  • Update
    16 July, 2024 at 16:54
    In progress
    16 July, 2024 at 16:54

    This work is ongoing and is likely to overrun due to an unexpected hardware problem.

  • In progress
    16 July, 2024 at 16:00
    In progress
    16 July, 2024 at 16:00
    Maintenance is now in progress
  • Update
    16 July, 2024 at 16:00
    Planned
    16 July, 2024 at 16:00

    The server that hosts storage for the departmental GPU cluster needs an urgent security update, and a reboot.

    This will necessitate shutting down all GPU and CPU development VMs, dev-gpu-* and dev-cpu-1, including the shared servers dev-gpu-1 and dev-cpu-1. These VMs' disks, as well as associated data directories (GPU home directories and shared "gpuscratch" space), will be unavailable for about half an hour. As it will take time to shut down and restart the VM infrastructure, VMs will be unavailable for longer: approximately an hour.

    We will take the opportunity to add RAM to the storage server too, to improve performance.

  • Planned
    16 July, 2024 at 11:00
    Planned
    16 July, 2024 at 11:00

    Reminder: this maintenance is taking place at 17:00 today and will require all dev-gpu-* and dev-cpu-* VMs to be shut down.