University of Cambridge Computer Laboratory - Archive server: temporarily reduced performance following electrical fault – Incident details

Archive server: temporarily reduced performance following electrical fault

Resolved
Degraded performance
Started over 1 year agoLasted 1 day

Affected

Data Storage

Degraded performance from 4:39 PM to 8:58 PM

Archive Server

Degraded performance from 4:39 PM to 8:58 PM

Updates
  • Resolved
    Resolved

    We believe the performance impact of the ongoing data integrity checks is minimal, so are closing the incident.

  • Monitoring
    Update

    The first, and smallest, of archive's three storage pools (CIFS data) has finished its integrity check ("scrub") without encountering any problems. That took 12.5 hours to scrub 32TB.

    Yet to be checked are the NFS data pool (156TB) and then the NFS backup pool (186TB). Extrapolating, it will take weeks to finish checking the data integrity.

    We believe that the performance impact is not excessive. We will continue to monitor the impact whilst we start to scan the busier NFS data pool, and if the performance remains acceptable we will probably then close this incident whilst the remainder of the data is checked.

  • Monitoring
    Monitoring

    NFS shares should now have been restored.

    Slightly impaired performance will continue, potentially for a couple of days, due to ongoing data integrity checks.

  • Identified
    Identified

    Following restart of the archive server, some configuration has been lost -- in particular, some filesystems are no longer shared via NFS. This problem is being investigated.

  • Monitoring
    Monitoring

    The archive server's filesystems are now available again; however performance will be degraded for a while on some filesystems whilst it verifies data integrity.

  • Identified
    Identified

    At 16:39 on 2023-02-08, the West Cambridge Data Centre was affected by a surge on the power grid which affected many systems across the University. In our case, the power fault caused two disk enclosures forming part of the archive server (archive.cl.cam.ac.uk) to briefly disconnect from the server. All volumes on this server are currently unavailable. We are working on restoring service.