University of Cambridge Computer Laboratory - GPU cluster storage maintenance – Maintenance details

GPUs experiencing partial outage

GPU cluster storage maintenance

Completed
Scheduled for July 16, 2024 at 4:00 PM – 6:02 PM

Affects

Virtual Machine Hosting

Under maintenance from 4:00 PM to 6:02 PM

GPUs

Under maintenance from 4:00 PM to 6:02 PM

Updates
  • Completed
    July 16, 2024 at 6:02 PM
    Completed
    July 16, 2024 at 6:02 PM

    This maintenance has been completed. Personal VMs can be started via Xen Orchestra (https://xo.cl.cam.ac.uk/) as needed.

  • Update
    July 16, 2024 at 5:36 PM
    In progress
    July 16, 2024 at 5:36 PM

    The outage has overrun due to a problem encountered during the storage server's RAM upgrade. Progress is being made; we can still upgrade the RAM and restore service, just not in the way we expected to. Current estimate for restoration of service: 19:00-19:15.

  • Update
    July 16, 2024 at 4:54 PM
    In progress
    July 16, 2024 at 4:54 PM

    This work is ongoing and is likely to overrun due to an unexpected hardware problem.

  • In progress
    July 16, 2024 at 4:00 PM
    In progress
    July 16, 2024 at 4:00 PM
    Maintenance is now in progress
  • Update
    July 16, 2024 at 4:00 PM
    Planned
    July 16, 2024 at 4:00 PM

    The server that hosts storage for the departmental GPU cluster needs an urgent security update, and a reboot.

    This will necessitate shutting down all GPU and CPU development VMs, dev-gpu-* and dev-cpu-1, including the shared servers dev-gpu-1 and dev-cpu-1. These VMs' disks, as well as associated data directories (GPU home directories and shared "gpuscratch" space), will be unavailable for about half an hour. As it will take time to shut down and restart the VM infrastructure, VMs will be unavailable for longer: approximately an hour.

    We will take the opportunity to add RAM to the storage server too, to improve performance.

  • Planned
    July 16, 2024 at 11:00 AM
    Planned
    July 16, 2024 at 11:00 AM

    Reminder: this maintenance is taking place at 17:00 today and will require all dev-gpu-* and dev-cpu-* VMs to be shut down.