GPU cluster storage fault

Resolved

Major outage

Started 12 months agoLasted 5 days

Affected

Virtual Machine Hosting

GPUs

Updates

Resolved
22 November, 2024 at 10:34
Resolved
22 November, 2024 at 10:34
We consider this incident resolved (albeit with temporary solutions); please contact service-desk@cst.cam.ac.uk if you are aware of any ongoing problems. Thanks again for your patience during this highly disruptive outage.
Update
21 November, 2024 at 13:01
Update
21 November, 2024 at 13:01
All copies of libcusparse, the one remaining corrupted file, have been restored from matching published versions (PyPi; NVIDIA; Conda). Therefore, we believe no data has been lost during this filesystem corruption incident. If you think you may be missing any data, or are still unable to use your VM or access your home directory, please contact service-desk@cst.cam.ac.uk as soon as possible.
Further work will be needed in due course to reinforce the resilience of the temporary storage servers, and then to migrate storage from its temporary servers to a new permanent storage server once that arrives. We will be in contact about that in due course.
Monitoring
21 November, 2024 at 12:00
Monitoring
21 November, 2024 at 12:00
Home directory storage has been migrated to a new server, and should be writable again. dev-gpu-1 and dev-cpu-1 are connected to the new storage. If your VM is currently running it may still be using the old server, i.e. you may still see a read-only home directory or other problems, but it can be connected to the new storage.
If your VM's home directory is still read-only: run "sudo cl-update-system" on your VM, wait for it to complete, then reboot the VM.
If your VM's home directory appears empty: don't panic, your data still exists but your VM has not mounted the right storage. Run "sudo cl-update-system" on your VM, wait for it to complete, then reboot the VM; if it still appears empty, email service-desk@cst.cam.ac.uk -- your VM's configuration probably just needs updating to point at the new storage, and our scripts to automatically do this for you didn't work on your VM for some reason, but we can fix it for you.
Some users are missing a file named "libcusparse.so.12" or "libcusparse.so.12.3.1.170" (generally within a Conda environment or Python virtualenv). You can simply download or recompile this file again in the same way that you originally created or generated this file, e.g. by reinstalling cusparse. We'll be restoring most of these files soon from a pristine NVIDIA/PyPi copy wherever possible.
Update
21 November, 2024 at 11:35
Update
21 November, 2024 at 11:35
Home directories on the GPU cluster are being switched across to an alternate server. Avoid rebooting/starting VMs, for the moment. The shared servers dev-gpu-1/dev-cpu-1 will be intermittently available whilst they are reconfigured. If you see an empty home directory, don't panic.
Update
21 November, 2024 at 10:46
Update
21 November, 2024 at 10:46
Personal dev-gpu/cpu VMs will be paused for a few minutes in order to make some changes to their temporary storage server, needed in order to have it also serve home directories.
Update
20 November, 2024 at 18:01
Update
20 November, 2024 at 18:01
Migration of home directories is ongoing. This is taking longer than anticipated due to a combination of software issues on the systems used for data recovery; even though we have an almost-complete copy of the data, some time-consuming additional work is needed in order to bring that into service. Apologies for the ongoing disruption.
Update
20 November, 2024 at 15:18
Update
20 November, 2024 at 15:18
/anfs/gpuscratch is now on an alternate server, and is writable again. If you are still unable to write to it, please try restarting autofs, or contact service-desk@cst.cam.ac.uk and we'll help.
We have nearly completed relocation of GPU cluster home directories (and /anfs/gpucluster) to an alternate server and will post another update later this afternoon.
Update
19 November, 2024 at 16:21
Update
19 November, 2024 at 16:21
VM disks have been moved to an alternate storage server; VMs should now be working as usual and you are welcome to start them using Xen Orchestra.
Home directories on the GPU VM cluster are still read-only, for now. These are being migrated off the failing server.
Update
19 November, 2024 at 14:13
Update
19 November, 2024 at 14:13
Personal dev-gpu/cpu VMs will now be shut down in order to complete transfer virtual disk storage to an alternate server. We hope that VMs will be available again within a few hours.
Home directories will remain accessible, read-only, via dev-cpu-1 and dev-gpu-1.
Update
18 November, 2024 at 22:50
Update
18 November, 2024 at 22:50
Migration of VM disks will not take place this evening as an ongoing copy of the recovered data to a suitable server will not finish until tomorrow at the soonest.
dev-gpu-1 and dev-cpu-1's disks (system and scratch) have been migrated to alternative storage, so these shared servers should be a bit more stable now.
Please leave your personal VM shut down if at all possible. If you need to copy data elsewhere, use dev-cpu-1, which should have access to the same home directory in most cases.
Home directories on the whole GPU cluster remain unstable, and may be gradually corrupting further over time. Home directories have been made read-only in an attempt to reduce further corruption.
At the time of writing only one file in users' home directories is known to be corrupted/unreadable -- specifically, any copy of libcusparse.so.12.3.1.170 in any location. (Any copy of that file is actually a pointer to the same data on disk due to filesystem deduplication, and that data is corrupted.) We expect be able to recover this file in due course.
We are in the process of copying home directories to another temporary server so that further corruption does not happen and a stable service can be restored pending arrival of new hardware for a long-term fix.
Update
18 November, 2024 at 17:27
Update
18 November, 2024 at 17:27
It is likely that dev-gpu/cpu VMs' disks will be reverted to the state they were in at midday today, 2024-11-18, as we have a seemingly intact snapshot from that time on a disaster recovery server. If you keep using your VM, which is not advised, then any changes made to its local filesystem will probably be rolled back.
VMs will be shut down this evening in order to switch to an alternate storage system.
Update
18 November, 2024 at 17:09
Update
18 November, 2024 at 17:09
svr-compilers0 has been moved to temporary alternate hardware and storage, so is no longer affected by this outage. The outage affects dev-gpu-*, dev-cpu-*, /anfs/gpucluster and /anfs/gpuscratch.
Storage is currently available but is unstable, with some files unreadable. We are continuing to work on migrating affected data (VMs first, then home directories) off the affected server and on to temporary hardware, pending arrival of a replacement storage server.
Identified
18 November, 2024 at 11:01
Identified
18 November, 2024 at 11:01
The NFS service locked up this morning; we are rebooting the server.
We plan to move the GPU VM disk storage service onto disaster-recovery hardware later today, in order to try to keep VMs stable even if home directories are not. However the performance will be reduced.
We are considering options for relocation of the GPU cluster home directory service onto alternate hardware.
Monitoring
18 November, 2024 at 00:09
Monitoring
18 November, 2024 at 00:09
Storage is available again; however the current situation is fragile:
- A very small number of files in home directories on the GPU cluster are unreadable due to corruption
- Some filesystem safety features have been disabled
- There is a chance that the server may spontaneously reboot and/or that more corruption will occur, causing further disruption
- Performance may be slightly degraded
- More disruptive maintenance will be needed soon as the current solution is temporary
If you have important data on the GPU cluster, you are reminded to take your own backups of this data on another system.
Identified
17 November, 2024 at 18:04
Identified
17 November, 2024 at 18:04
Further urgent maintenance is needed to investigate the filesystem fault, which seems to be getting worse. Running VMs have been paused.
Monitoring
17 November, 2024 at 17:23
Monitoring
17 November, 2024 at 17:23
Some VMs on the cluster may fail to start due to filesystem issues. You may see an "(initramfs)" prompt on the console. If your VM does not start, contact service-desk@cst.cam.ac.uk.
Identified
17 November, 2024 at 17:05
Identified
17 November, 2024 at 17:05
A filesystem (ZFS) malfunction on the GPU cluster storage server was detected last night and is being investigated. The data is believed to be almost completely intact and available but there is some minor corruption which is hopefully only affecting archived data belonging to users who have left the department. However, when investigating this issue (running "zpool scrub") the server froze for a few minutes, leading to timeouts on virtual machines' disks.
VMs on this cluster (mostly named dev-cpu-* / dev-gpu-*) will have failed. Shared VMs have been rebooted. Personal VMs will be shut down and can be restarted from Xen Orchestra.
At this stage we cannot rule out further disruption, unfortunately.
Investigating
17 November, 2024 at 16:36
Investigating
17 November, 2024 at 16:36
We are investigating a fault with storage on the GPU cluster, which has caused some GPU virtual machines including the shared servers dev-gpu-1 and dev-cpu-1 to fail.

University of Cambridge Computer Laboratory - GPU cluster storage fault – Incident details

All systems operational

GPU cluster storage fault