University of Cambridge Computer Laboratory - GPU cluster storage fault – Incident details

GPU cluster storage fault

Resolved
Major outage
Started 26 days agoLasted about 10 hours

Affected

Virtual Machine Hosting

Partial outage from 4:51 PM to 5:15 PM, Major outage from 5:15 PM to 11:27 PM, Partial outage from 11:27 PM to 2:34 AM

GPUs

Partial outage from 4:51 PM to 5:15 PM, Major outage from 5:15 PM to 11:27 PM, Partial outage from 11:27 PM to 2:34 AM

Data Storage

Partial outage from 4:51 PM to 6:37 PM, Operational from 6:37 PM to 2:34 AM

Other Secondary Storage Systems

Partial outage from 4:51 PM to 6:37 PM, Operational from 6:37 PM to 2:34 AM

Updates
  • Resolved
    Resolved
    This incident has been resolved.
  • Update
    Update

    Personal dev-gpu / dev-cpu VMs can now be started via Xen Orchestra.

    Some VMs may need some maintenance in order to start:

    • VMs that were running during the incident may have unclean filesystems that need a repair. Generally you will see the boot process end with "(initramfs)" on the console. Contact service-desk@cst.cam.ac.uk for help.

    • VMs that have not been booted for a long time may need a manual update to /etc/fstab. If your VM appears to start but you have no home directory or your home directory is read-only, either run "sudo cl-update-system" then reboot, or contact service-desk@cst.cam.ac.uk for help.

    The shared servers dev-gpu-1 and dev-cpu-1 will be unavailable for a little while longer.

  • Update
    Update

    Access to GPU cluster home directories and scratch space has been restored using the new storage server; these are accessible from Lab-managed Linux systems outside the GPU VM cluster via /anfs/gpucluster/$USER and /anfs/gpuscratch respectively. You can access this data via SSH to slogin.cl.cam.ac.uk.

    dev-gpu-acs will be available shortly, for ACS students' use only.

    GPU/CPU development VMs and the shared servers dev-gpu-1 and dev-cpu-1 remain unavailable; copying their VM disks will take longer. They should be restored to service later this evening.

  • Update
    Update

    Please do not attempt to start or stop any dev-gpu or dev-cpu VM at this time. It won't be successful, and might cause your VM to get into a more broken state.

  • Identified
    Identified

    As the GPU cluster is currently unusable anyway due to a fault with the temporary storage server, and we have a replacement storage server ready to go into service, we will take this opportunity to migrate data to the new server. This may take a few hours.

    We believe that no data has been lost. The temporary storage is functioning, but the NFS service is not.

  • Update
    Update

    This issue is now also affecting clients which already have the filesystem mounted. They may see a permission error. Most dev-gpu/dev-cpu VMs have probably frozen as they can no longer access their disks.

  • Investigating
    Investigating

    We are investigating a problem whereby dev-gpu/dev-cpu home directories are failing to mount. The likely symptom is that GPU VMs will hang during boot, but VMs that are already running will keep working. Also, access to 'gpuscratch' paths may cause the client system to lock up.

    This is due to a suspected Linux kernel bug on a storage server.

    It is possible that some disruption will occur whilst we try to fix this.