University of Cambridge Computer Laboratory - Status Page

GPU cluster storage fault
  • Update
    Update

    Migration of home directories is ongoing. This is taking longer than anticipated due to a combination of software issues on the systems used for data recovery; even though we have an almost-complete copy of the data, some time-consuming additional work is needed in order to bring that into service. Apologies for the ongoing disruption.

  • Update
    Update

    /anfs/gpuscratch is now on an alternate server, and is writable again. If you are still unable to write to it, please try restarting autofs, or contact service-desk@cst.cam.ac.uk and we'll help.

    We have nearly completed relocation of GPU cluster home directories (and /anfs/gpucluster) to an alternate server and will post another update later this afternoon.

  • Update
    Update

    VM disks have been moved to an alternate storage server; VMs should now be working as usual and you are welcome to start them using Xen Orchestra.

    Home directories on the GPU VM cluster are still read-only, for now. These are being migrated off the failing server.

  • Update
    Update

    Personal dev-gpu/cpu VMs will now be shut down in order to complete transfer virtual disk storage to an alternate server. We hope that VMs will be available again within a few hours.

    Home directories will remain accessible, read-only, via dev-cpu-1 and dev-gpu-1.

  • Update
    Update

    Migration of VM disks will not take place this evening as an ongoing copy of the recovered data to a suitable server will not finish until tomorrow at the soonest.

    dev-gpu-1 and dev-cpu-1's disks (system and scratch) have been migrated to alternative storage, so these shared servers should be a bit more stable now.

    Please leave your personal VM shut down if at all possible. If you need to copy data elsewhere, use dev-cpu-1, which should have access to the same home directory in most cases.

    Home directories on the whole GPU cluster remain unstable, and may be gradually corrupting further over time. Home directories have been made read-only in an attempt to reduce further corruption.

    At the time of writing only one file in users' home directories is known to be corrupted/unreadable -- specifically, any copy of libcusparse.so.12.3.1.170 in any location. (Any copy of that file is actually a pointer to the same data on disk due to filesystem deduplication, and that data is corrupted.) We expect be able to recover this file in due course.

    We are in the process of copying home directories to another temporary server so that further corruption does not happen and a stable service can be restored pending arrival of new hardware for a long-term fix.

  • Update
    Update

    It is likely that dev-gpu/cpu VMs' disks will be reverted to the state they were in at midday today, 2024-11-18, as we have a seemingly intact snapshot from that time on a disaster recovery server. If you keep using your VM, which is not advised, then any changes made to its local filesystem will probably be rolled back.

    VMs will be shut down this evening in order to switch to an alternate storage system.

  • Update
    Update

    svr-compilers0 has been moved to temporary alternate hardware and storage, so is no longer affected by this outage. The outage affects dev-gpu-*, dev-cpu-*, /anfs/gpucluster and /anfs/gpuscratch.

    Storage is currently available but is unstable, with some files unreadable. We are continuing to work on migrating affected data (VMs first, then home directories) off the affected server and on to temporary hardware, pending arrival of a replacement storage server.

  • Identified
    Identified

    The NFS service locked up this morning; we are rebooting the server.

    We plan to move the GPU VM disk storage service onto disaster-recovery hardware later today, in order to try to keep VMs stable even if home directories are not. However the performance will be reduced.

    We are considering options for relocation of the GPU cluster home directory service onto alternate hardware.

  • Monitoring
    Monitoring

    Storage is available again; however the current situation is fragile:

    • A very small number of files in home directories on the GPU cluster are unreadable due to corruption

    • Some filesystem safety features have been disabled

    • There is a chance that the server may spontaneously reboot and/or that more corruption will occur, causing further disruption

    • Performance may be slightly degraded

    • More disruptive maintenance will be needed soon as the current solution is temporary

    If you have important data on the GPU cluster, you are reminded to take your own backups of this data on another system.

  • Identified
    Identified

    Further urgent maintenance is needed to investigate the filesystem fault, which seems to be getting worse. Running VMs have been paused.

  • Monitoring
    Monitoring

    Some VMs on the cluster may fail to start due to filesystem issues. You may see an "(initramfs)" prompt on the console. If your VM does not start, contact service-desk@cst.cam.ac.uk.

  • Identified
    Identified

    A filesystem (ZFS) malfunction on the GPU cluster storage server was detected last night and is being investigated. The data is believed to be almost completely intact and available but there is some minor corruption which is hopefully only affecting archived data belonging to users who have left the department. However, when investigating this issue (running "zpool scrub") the server froze for a few minutes, leading to timeouts on virtual machines' disks.

    VMs on this cluster (mostly named dev-cpu-* / dev-gpu-*) will have failed. Shared VMs have been rebooted. Personal VMs will be shut down and can be restarted from Xen Orchestra.

    At this stage we cannot rule out further disruption, unfortunately.

  • Investigating
    Investigating

    We are investigating a fault with storage on the GPU cluster, which has caused some GPU virtual machines including the shared servers dev-gpu-1 and dev-cpu-1 to fail.

100% - uptime

Caelum Console (server management) - Operational

100% - uptime

Request Tracker - Operational

100% - uptime

Other Internal Services - Operational

100% - uptime

External Services - Operational

100% - uptime

Network - Operational

100% - uptime
100% - uptime

GN09 - Operational

100% - uptime

WCDC - Operational

100% - uptime
100% - uptime

Main VM Pool (WCDC) - Operational

100% - uptime

GPUs - Under maintenance

100% - uptime

Secondary VM Hosts - Operational

100% - uptime

Xen Orchestra - Operational

100% - uptime
100% - uptime

Filer - Operational

100% - uptime

Archive Server - Operational

100% - uptime

Data Replication - Operational

100% - uptime

Other Secondary Storage Systems - Operational

100% - uptime
100% - uptime

Third Party: Fastmail → General Availability - Operational

Third Party: Fastmail → Mail delivery - Operational

Third Party: Fastmail → Web client and mobile app - Operational

Third Party: Fastmail → Mail access (IMAP/POP) - Operational

Third Party: Fastmail → Login & sessions - Operational

Third Party: Fastmail → Contacts (CardDAV) - Operational

Recent notices

Show notice history