University of Cambridge Computer Laboratory - Notice history

100% - uptime

Caelum Console (server management) - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

Request Tracker - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

Other Internal Services - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

External Services - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

Network - Operational

100% - uptime
Oct 2024 · 99.95%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024
100% - uptime

GN09 - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

WCDC - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024
100% - uptime

Main VM Pool (WCDC) - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

GPUs - Operational

99% - uptime
Oct 2024 · 100.0%Nov · 98.48%Dec · 99.02%
Oct 2024
Nov 2024
Dec 2024

Secondary VM Hosts - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

Xen Orchestra - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024
100% - uptime

Filer - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

Archive Server - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

Data Replication - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2024
Nov 2024
Dec 2024

Other Secondary Storage Systems - Operational

100% - uptime
Oct 2024 · 100.0%Nov · 100.0%Dec · 99.93%
Oct 2024
Nov 2024
Dec 2024
100% - uptime

Third Party: Fastmail → General Availability - Operational

Third Party: Fastmail → Mail delivery - Operational

Third Party: Fastmail → Web client and mobile app - Operational

Third Party: Fastmail → Mail access (IMAP/POP) - Operational

Third Party: Fastmail → Login & sessions - Operational

Third Party: Fastmail → Contacts (CardDAV) - Operational

Notice history

Dec 2024

GPU cluster storage fault
  • Resolved
    Resolved
    This incident has been resolved.
  • Update
    Update

    Personal dev-gpu / dev-cpu VMs can now be started via Xen Orchestra.

    Some VMs may need some maintenance in order to start:

    • VMs that were running during the incident may have unclean filesystems that need a repair. Generally you will see the boot process end with "(initramfs)" on the console. Contact service-desk@cst.cam.ac.uk for help.

    • VMs that have not been booted for a long time may need a manual update to /etc/fstab. If your VM appears to start but you have no home directory or your home directory is read-only, either run "sudo cl-update-system" then reboot, or contact service-desk@cst.cam.ac.uk for help.

    The shared servers dev-gpu-1 and dev-cpu-1 will be unavailable for a little while longer.

  • Update
    Update

    Access to GPU cluster home directories and scratch space has been restored using the new storage server; these are accessible from Lab-managed Linux systems outside the GPU VM cluster via /anfs/gpucluster/$USER and /anfs/gpuscratch respectively. You can access this data via SSH to slogin.cl.cam.ac.uk.

    dev-gpu-acs will be available shortly, for ACS students' use only.

    GPU/CPU development VMs and the shared servers dev-gpu-1 and dev-cpu-1 remain unavailable; copying their VM disks will take longer. They should be restored to service later this evening.

  • Update
    Update

    Please do not attempt to start or stop any dev-gpu or dev-cpu VM at this time. It won't be successful, and might cause your VM to get into a more broken state.

  • Identified
    Identified

    As the GPU cluster is currently unusable anyway due to a fault with the temporary storage server, and we have a replacement storage server ready to go into service, we will take this opportunity to migrate data to the new server. This may take a few hours.

    We believe that no data has been lost. The temporary storage is functioning, but the NFS service is not.

  • Update
    Update

    This issue is now also affecting clients which already have the filesystem mounted. They may see a permission error. Most dev-gpu/dev-cpu VMs have probably frozen as they can no longer access their disks.

  • Investigating
    Investigating

    We are investigating a problem whereby dev-gpu/dev-cpu home directories are failing to mount. The likely symptom is that GPU VMs will hang during boot, but VMs that are already running will keep working. Also, access to 'gpuscratch' paths may cause the client system to lock up.

    This is due to a suspected Linux kernel bug on a storage server.

    It is possible that some disruption will occur whilst we try to fix this.

Nov 2024

Urgent storage server maintenance
  • Completed
    November 29, 2024 at 3:18 PM
    Completed
    November 29, 2024 at 3:18 PM

    This maintenance has been completed. Some dev-gpu/dev-cpu VMs may need rebooting if they lost write access to their filesystems. Please contact service-desk@cst.cam.ac.uk if you encounter any problems.

  • Update
    November 29, 2024 at 3:04 PM
    In progress
    November 29, 2024 at 3:04 PM

    We now think that a storage server reboot will be required. VMs will be paused for a short while. Home directory and gpuscratch access from shared servers will be interrupted for a few minutes.

  • In progress
    November 29, 2024 at 2:00 PM
    In progress
    November 29, 2024 at 2:00 PM
    Maintenance is now in progress
  • Planned
    November 29, 2024 at 2:00 PM
    Planned
    November 29, 2024 at 2:00 PM

    In the aftermath of the GPU cluster storage incident on 17 November, one of the servers that we are using as a temporary host for GPU cluster data needs an urgent power supply upgrade. During the data recovery we added additional SSDs to this server, which unexpectedly caused the system to report that its power usage could now theoretically exceed the capacity of its power supplies (even though in practice it does not).

    We will be replacing the power supplies whilst the server is in use, as we believe this will not be disruptive, but there is a chance of a server reboot which would cause some temporary disruption to the GPU cluster.

GPU cluster storage fault
  • Resolved
    Resolved

    We consider this incident resolved (albeit with temporary solutions); please contact service-desk@cst.cam.ac.uk if you are aware of any ongoing problems. Thanks again for your patience during this highly disruptive outage.

  • Update
    Update

    All copies of libcusparse, the one remaining corrupted file, have been restored from matching published versions (PyPi; NVIDIA; Conda). Therefore, we believe no data has been lost during this filesystem corruption incident. If you think you may be missing any data, or are still unable to use your VM or access your home directory, please contact service-desk@cst.cam.ac.uk as soon as possible.

    Further work will be needed in due course to reinforce the resilience of the temporary storage servers, and then to migrate storage from its temporary servers to a new permanent storage server once that arrives. We will be in contact about that in due course.

  • Monitoring
    Monitoring

    Home directory storage has been migrated to a new server, and should be writable again. dev-gpu-1 and dev-cpu-1 are connected to the new storage. If your VM is currently running it may still be using the old server, i.e. you may still see a read-only home directory or other problems, but it can be connected to the new storage.

    If your VM's home directory is still read-only: run "sudo cl-update-system" on your VM, wait for it to complete, then reboot the VM.

    If your VM's home directory appears empty: don't panic, your data still exists but your VM has not mounted the right storage. Run "sudo cl-update-system" on your VM, wait for it to complete, then reboot the VM; if it still appears empty, email service-desk@cst.cam.ac.uk -- your VM's configuration probably just needs updating to point at the new storage, and our scripts to automatically do this for you didn't work on your VM for some reason, but we can fix it for you.

    Some users are missing a file named "libcusparse.so.12" or "libcusparse.so.12.3.1.170" (generally within a Conda environment or Python virtualenv). You can simply download or recompile this file again in the same way that you originally created or generated this file, e.g. by reinstalling cusparse. We'll be restoring most of these files soon from a pristine NVIDIA/PyPi copy wherever possible.

  • Update
    Update

    Home directories on the GPU cluster are being switched across to an alternate server. Avoid rebooting/starting VMs, for the moment. The shared servers dev-gpu-1/dev-cpu-1 will be intermittently available whilst they are reconfigured. If you see an empty home directory, don't panic.

  • Update
    Update

    Personal dev-gpu/cpu VMs will be paused for a few minutes in order to make some changes to their temporary storage server, needed in order to have it also serve home directories.

  • Update
    Update

    Migration of home directories is ongoing. This is taking longer than anticipated due to a combination of software issues on the systems used for data recovery; even though we have an almost-complete copy of the data, some time-consuming additional work is needed in order to bring that into service. Apologies for the ongoing disruption.

  • Update
    Update

    /anfs/gpuscratch is now on an alternate server, and is writable again. If you are still unable to write to it, please try restarting autofs, or contact service-desk@cst.cam.ac.uk and we'll help.

    We have nearly completed relocation of GPU cluster home directories (and /anfs/gpucluster) to an alternate server and will post another update later this afternoon.

  • Update
    Update

    VM disks have been moved to an alternate storage server; VMs should now be working as usual and you are welcome to start them using Xen Orchestra.

    Home directories on the GPU VM cluster are still read-only, for now. These are being migrated off the failing server.

  • Update
    Update

    Personal dev-gpu/cpu VMs will now be shut down in order to complete transfer virtual disk storage to an alternate server. We hope that VMs will be available again within a few hours.

    Home directories will remain accessible, read-only, via dev-cpu-1 and dev-gpu-1.

  • Update
    Update

    Migration of VM disks will not take place this evening as an ongoing copy of the recovered data to a suitable server will not finish until tomorrow at the soonest.

    dev-gpu-1 and dev-cpu-1's disks (system and scratch) have been migrated to alternative storage, so these shared servers should be a bit more stable now.

    Please leave your personal VM shut down if at all possible. If you need to copy data elsewhere, use dev-cpu-1, which should have access to the same home directory in most cases.

    Home directories on the whole GPU cluster remain unstable, and may be gradually corrupting further over time. Home directories have been made read-only in an attempt to reduce further corruption.

    At the time of writing only one file in users' home directories is known to be corrupted/unreadable -- specifically, any copy of libcusparse.so.12.3.1.170 in any location. (Any copy of that file is actually a pointer to the same data on disk due to filesystem deduplication, and that data is corrupted.) We expect be able to recover this file in due course.

    We are in the process of copying home directories to another temporary server so that further corruption does not happen and a stable service can be restored pending arrival of new hardware for a long-term fix.

  • Update
    Update

    It is likely that dev-gpu/cpu VMs' disks will be reverted to the state they were in at midday today, 2024-11-18, as we have a seemingly intact snapshot from that time on a disaster recovery server. If you keep using your VM, which is not advised, then any changes made to its local filesystem will probably be rolled back.

    VMs will be shut down this evening in order to switch to an alternate storage system.

  • Update
    Update

    svr-compilers0 has been moved to temporary alternate hardware and storage, so is no longer affected by this outage. The outage affects dev-gpu-*, dev-cpu-*, /anfs/gpucluster and /anfs/gpuscratch.

    Storage is currently available but is unstable, with some files unreadable. We are continuing to work on migrating affected data (VMs first, then home directories) off the affected server and on to temporary hardware, pending arrival of a replacement storage server.

  • Identified
    Identified

    The NFS service locked up this morning; we are rebooting the server.

    We plan to move the GPU VM disk storage service onto disaster-recovery hardware later today, in order to try to keep VMs stable even if home directories are not. However the performance will be reduced.

    We are considering options for relocation of the GPU cluster home directory service onto alternate hardware.

  • Monitoring
    Monitoring

    Storage is available again; however the current situation is fragile:

    • A very small number of files in home directories on the GPU cluster are unreadable due to corruption

    • Some filesystem safety features have been disabled

    • There is a chance that the server may spontaneously reboot and/or that more corruption will occur, causing further disruption

    • Performance may be slightly degraded

    • More disruptive maintenance will be needed soon as the current solution is temporary

    If you have important data on the GPU cluster, you are reminded to take your own backups of this data on another system.

  • Identified
    Identified

    Further urgent maintenance is needed to investigate the filesystem fault, which seems to be getting worse. Running VMs have been paused.

  • Monitoring
    Monitoring

    Some VMs on the cluster may fail to start due to filesystem issues. You may see an "(initramfs)" prompt on the console. If your VM does not start, contact service-desk@cst.cam.ac.uk.

  • Identified
    Identified

    A filesystem (ZFS) malfunction on the GPU cluster storage server was detected last night and is being investigated. The data is believed to be almost completely intact and available but there is some minor corruption which is hopefully only affecting archived data belonging to users who have left the department. However, when investigating this issue (running "zpool scrub") the server froze for a few minutes, leading to timeouts on virtual machines' disks.

    VMs on this cluster (mostly named dev-cpu-* / dev-gpu-*) will have failed. Shared VMs have been rebooted. Personal VMs will be shut down and can be restarted from Xen Orchestra.

    At this stage we cannot rule out further disruption, unfortunately.

  • Investigating
    Investigating

    We are investigating a fault with storage on the GPU cluster, which has caused some GPU virtual machines including the shared servers dev-gpu-1 and dev-cpu-1 to fail.

Oct 2024 to Dec 2024

Next