Caelum Console (server management) - Operational
Caelum Console (server management)
Request Tracker - Operational
Request Tracker
Other Internal Services - Operational
Other Internal Services
External Services - Operational
External Services
Network - Operational
Network
GN09 - Operational
GN09
WCDC - Operational
WCDC
Main VM Pool (WCDC) - Operational
Main VM Pool (WCDC)
GPUs - Operational
GPUs
Secondary VM Hosts - Operational
Secondary VM Hosts
Xen Orchestra - Operational
Xen Orchestra
Filer - Operational
Filer
Archive Server - Operational
Archive Server
Data Replication - Operational
Data Replication
Other Secondary Storage Systems - Operational
Other Secondary Storage Systems
Notice history
Dec 2024
- ResolvedResolvedThis incident has been resolved.
- UpdateUpdate
Personal dev-gpu / dev-cpu VMs can now be started via Xen Orchestra.
Some VMs may need some maintenance in order to start:
VMs that were running during the incident may have unclean filesystems that need a repair. Generally you will see the boot process end with "(initramfs)" on the console. Contact service-desk@cst.cam.ac.uk for help.
VMs that have not been booted for a long time may need a manual update to /etc/fstab. If your VM appears to start but you have no home directory or your home directory is read-only, either run "sudo cl-update-system" then reboot, or contact service-desk@cst.cam.ac.uk for help.
The shared servers dev-gpu-1 and dev-cpu-1 will be unavailable for a little while longer.
- UpdateUpdate
Access to GPU cluster home directories and scratch space has been restored using the new storage server; these are accessible from Lab-managed Linux systems outside the GPU VM cluster via /anfs/gpucluster/$USER and /anfs/gpuscratch respectively. You can access this data via SSH to slogin.cl.cam.ac.uk.
dev-gpu-acs will be available shortly, for ACS students' use only.
GPU/CPU development VMs and the shared servers dev-gpu-1 and dev-cpu-1 remain unavailable; copying their VM disks will take longer. They should be restored to service later this evening.
- UpdateUpdate
Please do not attempt to start or stop any dev-gpu or dev-cpu VM at this time. It won't be successful, and might cause your VM to get into a more broken state.
- IdentifiedIdentified
As the GPU cluster is currently unusable anyway due to a fault with the temporary storage server, and we have a replacement storage server ready to go into service, we will take this opportunity to migrate data to the new server. This may take a few hours.
We believe that no data has been lost. The temporary storage is functioning, but the NFS service is not.
- UpdateUpdate
This issue is now also affecting clients which already have the filesystem mounted. They may see a permission error. Most dev-gpu/dev-cpu VMs have probably frozen as they can no longer access their disks.
- InvestigatingInvestigating
We are investigating a problem whereby dev-gpu/dev-cpu home directories are failing to mount. The likely symptom is that GPU VMs will hang during boot, but VMs that are already running will keep working. Also, access to 'gpuscratch' paths may cause the client system to lock up.
This is due to a suspected Linux kernel bug on a storage server.
It is possible that some disruption will occur whilst we try to fix this.
Nov 2024
- CompletedNovember 29, 2024 at 3:18 PMCompletedNovember 29, 2024 at 3:18 PM
This maintenance has been completed. Some dev-gpu/dev-cpu VMs may need rebooting if they lost write access to their filesystems. Please contact service-desk@cst.cam.ac.uk if you encounter any problems.
- UpdateNovember 29, 2024 at 3:04 PMIn progressNovember 29, 2024 at 3:04 PM
We now think that a storage server reboot will be required. VMs will be paused for a short while. Home directory and gpuscratch access from shared servers will be interrupted for a few minutes.
- In progressNovember 29, 2024 at 2:00 PMIn progressNovember 29, 2024 at 2:00 PMMaintenance is now in progress
- PlannedNovember 29, 2024 at 2:00 PMPlannedNovember 29, 2024 at 2:00 PM
In the aftermath of the GPU cluster storage incident on 17 November, one of the servers that we are using as a temporary host for GPU cluster data needs an urgent power supply upgrade. During the data recovery we added additional SSDs to this server, which unexpectedly caused the system to report that its power usage could now theoretically exceed the capacity of its power supplies (even though in practice it does not).
We will be replacing the power supplies whilst the server is in use, as we believe this will not be disruptive, but there is a chance of a server reboot which would cause some temporary disruption to the GPU cluster.
- ResolvedResolved
We consider this incident resolved (albeit with temporary solutions); please contact service-desk@cst.cam.ac.uk if you are aware of any ongoing problems. Thanks again for your patience during this highly disruptive outage.
- UpdateUpdate
All copies of libcusparse, the one remaining corrupted file, have been restored from matching published versions (PyPi; NVIDIA; Conda). Therefore, we believe no data has been lost during this filesystem corruption incident. If you think you may be missing any data, or are still unable to use your VM or access your home directory, please contact service-desk@cst.cam.ac.uk as soon as possible.
Further work will be needed in due course to reinforce the resilience of the temporary storage servers, and then to migrate storage from its temporary servers to a new permanent storage server once that arrives. We will be in contact about that in due course.
- MonitoringMonitoring
Home directory storage has been migrated to a new server, and should be writable again. dev-gpu-1 and dev-cpu-1 are connected to the new storage. If your VM is currently running it may still be using the old server, i.e. you may still see a read-only home directory or other problems, but it can be connected to the new storage.
If your VM's home directory is still read-only: run "sudo cl-update-system" on your VM, wait for it to complete, then reboot the VM.
If your VM's home directory appears empty: don't panic, your data still exists but your VM has not mounted the right storage. Run "sudo cl-update-system" on your VM, wait for it to complete, then reboot the VM; if it still appears empty, email service-desk@cst.cam.ac.uk -- your VM's configuration probably just needs updating to point at the new storage, and our scripts to automatically do this for you didn't work on your VM for some reason, but we can fix it for you.
Some users are missing a file named "libcusparse.so.12" or "libcusparse.so.12.3.1.170" (generally within a Conda environment or Python virtualenv). You can simply download or recompile this file again in the same way that you originally created or generated this file, e.g. by reinstalling cusparse. We'll be restoring most of these files soon from a pristine NVIDIA/PyPi copy wherever possible.
- UpdateUpdate
Home directories on the GPU cluster are being switched across to an alternate server. Avoid rebooting/starting VMs, for the moment. The shared servers dev-gpu-1/dev-cpu-1 will be intermittently available whilst they are reconfigured. If you see an empty home directory, don't panic.
- UpdateUpdate
Personal dev-gpu/cpu VMs will be paused for a few minutes in order to make some changes to their temporary storage server, needed in order to have it also serve home directories.
- UpdateUpdate
Migration of home directories is ongoing. This is taking longer than anticipated due to a combination of software issues on the systems used for data recovery; even though we have an almost-complete copy of the data, some time-consuming additional work is needed in order to bring that into service. Apologies for the ongoing disruption.
- UpdateUpdate
/anfs/gpuscratch is now on an alternate server, and is writable again. If you are still unable to write to it, please try restarting autofs, or contact service-desk@cst.cam.ac.uk and we'll help.
We have nearly completed relocation of GPU cluster home directories (and /anfs/gpucluster) to an alternate server and will post another update later this afternoon.
- UpdateUpdate
VM disks have been moved to an alternate storage server; VMs should now be working as usual and you are welcome to start them using Xen Orchestra.
Home directories on the GPU VM cluster are still read-only, for now. These are being migrated off the failing server.
- UpdateUpdate
Personal dev-gpu/cpu VMs will now be shut down in order to complete transfer virtual disk storage to an alternate server. We hope that VMs will be available again within a few hours.
Home directories will remain accessible, read-only, via dev-cpu-1 and dev-gpu-1.
- UpdateUpdate
Migration of VM disks will not take place this evening as an ongoing copy of the recovered data to a suitable server will not finish until tomorrow at the soonest.
dev-gpu-1 and dev-cpu-1's disks (system and scratch) have been migrated to alternative storage, so these shared servers should be a bit more stable now.
Please leave your personal VM shut down if at all possible. If you need to copy data elsewhere, use dev-cpu-1, which should have access to the same home directory in most cases.
Home directories on the whole GPU cluster remain unstable, and may be gradually corrupting further over time. Home directories have been made read-only in an attempt to reduce further corruption.
At the time of writing only one file in users' home directories is known to be corrupted/unreadable -- specifically, any copy of libcusparse.so.12.3.1.170 in any location. (Any copy of that file is actually a pointer to the same data on disk due to filesystem deduplication, and that data is corrupted.) We expect be able to recover this file in due course.
We are in the process of copying home directories to another temporary server so that further corruption does not happen and a stable service can be restored pending arrival of new hardware for a long-term fix.
- UpdateUpdate
It is likely that dev-gpu/cpu VMs' disks will be reverted to the state they were in at midday today, 2024-11-18, as we have a seemingly intact snapshot from that time on a disaster recovery server. If you keep using your VM, which is not advised, then any changes made to its local filesystem will probably be rolled back.
VMs will be shut down this evening in order to switch to an alternate storage system.
- UpdateUpdate
svr-compilers0 has been moved to temporary alternate hardware and storage, so is no longer affected by this outage. The outage affects dev-gpu-*, dev-cpu-*, /anfs/gpucluster and /anfs/gpuscratch.
Storage is currently available but is unstable, with some files unreadable. We are continuing to work on migrating affected data (VMs first, then home directories) off the affected server and on to temporary hardware, pending arrival of a replacement storage server.
- IdentifiedIdentified
The NFS service locked up this morning; we are rebooting the server.
We plan to move the GPU VM disk storage service onto disaster-recovery hardware later today, in order to try to keep VMs stable even if home directories are not. However the performance will be reduced.
We are considering options for relocation of the GPU cluster home directory service onto alternate hardware.
- MonitoringMonitoring
Storage is available again; however the current situation is fragile:
A very small number of files in home directories on the GPU cluster are unreadable due to corruption
Some filesystem safety features have been disabled
There is a chance that the server may spontaneously reboot and/or that more corruption will occur, causing further disruption
Performance may be slightly degraded
More disruptive maintenance will be needed soon as the current solution is temporary
If you have important data on the GPU cluster, you are reminded to take your own backups of this data on another system.
- IdentifiedIdentified
Further urgent maintenance is needed to investigate the filesystem fault, which seems to be getting worse. Running VMs have been paused.
- MonitoringMonitoring
Some VMs on the cluster may fail to start due to filesystem issues. You may see an "(initramfs)" prompt on the console. If your VM does not start, contact service-desk@cst.cam.ac.uk.
- IdentifiedIdentified
A filesystem (ZFS) malfunction on the GPU cluster storage server was detected last night and is being investigated. The data is believed to be almost completely intact and available but there is some minor corruption which is hopefully only affecting archived data belonging to users who have left the department. However, when investigating this issue (running "zpool scrub") the server froze for a few minutes, leading to timeouts on virtual machines' disks.
VMs on this cluster (mostly named dev-cpu-* / dev-gpu-*) will have failed. Shared VMs have been rebooted. Personal VMs will be shut down and can be restarted from Xen Orchestra.
At this stage we cannot rule out further disruption, unfortunately.
- InvestigatingInvestigating
We are investigating a fault with storage on the GPU cluster, which has caused some GPU virtual machines including the shared servers dev-gpu-1 and dev-cpu-1 to fail.