GPU cluster storage fault

Updates

Resolved
17 December, 2024 at 02:34
Resolved
17 December, 2024 at 02:34
This incident has been resolved.
Update
16 December, 2024 at 23:27
Update
16 December, 2024 at 23:27
Personal dev-gpu / dev-cpu VMs can now be started via Xen Orchestra.
Some VMs may need some maintenance in order to start:
- VMs that were running during the incident may have unclean filesystems that need a repair. Generally you will see the boot process end with "(initramfs)" on the console. Contact service-desk@cst.cam.ac.uk for help.
- VMs that have not been booted for a long time may need a manual update to /etc/fstab. If your VM appears to start but you have no home directory or your home directory is read-only, either run "sudo cl-update-system" then reboot, or contact service-desk@cst.cam.ac.uk for help.
The shared servers dev-gpu-1 and dev-cpu-1 will be unavailable for a little while longer.
Update
16 December, 2024 at 18:37
Update
16 December, 2024 at 18:37
Access to GPU cluster home directories and scratch space has been restored using the new storage server; these are accessible from Lab-managed Linux systems outside the GPU VM cluster via /anfs/gpucluster/$USER and /anfs/gpuscratch respectively. You can access this data via SSH to slogin.cl.cam.ac.uk.
dev-gpu-acs will be available shortly, for ACS students' use only.
GPU/CPU development VMs and the shared servers dev-gpu-1 and dev-cpu-1 remain unavailable; copying their VM disks will take longer. They should be restored to service later this evening.
Update
16 December, 2024 at 18:10
Update
16 December, 2024 at 18:10
Please do not attempt to start or stop any dev-gpu or dev-cpu VM at this time. It won't be successful, and might cause your VM to get into a more broken state.
Identified
16 December, 2024 at 17:24
Identified
16 December, 2024 at 17:24
As the GPU cluster is currently unusable anyway due to a fault with the temporary storage server, and we have a replacement storage server ready to go into service, we will take this opportunity to migrate data to the new server. This may take a few hours.
We believe that no data has been lost. The temporary storage is functioning, but the NFS service is not.
Update
16 December, 2024 at 17:15
Update
16 December, 2024 at 17:15
This issue is now also affecting clients which already have the filesystem mounted. They may see a permission error. Most dev-gpu/dev-cpu VMs have probably frozen as they can no longer access their disks.
Investigating
16 December, 2024 at 16:51
Investigating
16 December, 2024 at 16:51
We are investigating a problem whereby dev-gpu/dev-cpu home directories are failing to mount. The likely symptom is that GPU VMs will hang during boot, but VMs that are already running will keep working. Also, access to 'gpuscratch' paths may cause the client system to lock up.
This is due to a suspected Linux kernel bug on a storage server.
It is possible that some disruption will occur whilst we try to fix this.

University of Cambridge Computer Laboratory - GPU cluster storage fault – Incident details

All systems operational