University of Cambridge Computer Laboratory - gatwick (WGB core network) crashed – Incident details

gatwick (WGB core network) crashed

Resolved
Partial outage
Started 6 months agoLasted 5 days

Affected

Network

Partial outage from 1:27 PM to 2:23 PM, Operational from 2:23 PM to 12:58 PM

Updates
  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Update

    gatwick crashed and rebooted again at around 06:22, again triggered by a routine configuration update.

    We had, earlier in the night, attempted to install a software update but due to an unrelated issue, the routers refused to do an 'In Service Software Upgrade' - i.e. the upgrade would have caused more disruption - so we chose to roll back and delay this update until Cisco published their notes about this particular version.

  • Monitoring
    Update

    The routers have remained stable since the crash, but we're going to do some further testing out-of-hours, and install a software update. There may be some further disruption whilst that happens.

  • Monitoring
    Update

  • Monitoring
    Monitoring

    The core router and switch in the William Gates Building (gatwick) seemingly crashed and rebooted at around 14:27.

    This appears to have been due to a software bug triggered by a routine configuration change. Although gatwick is a virtual switch/router comprising two independent physical systems, it seems that the entire virtual switch/router (both physical systems) rebooted simultaneously.

    This type of device takes a long time to reboot; in this case there would have been a little over 12 minutes during which the William Gates Building office network was cut off from the University network and the internet (followed by a further few minutes of instability). This would also have affected connectivity to filer and a few other core services hosted in the WGB.

    Investigation is ongoing into the reason for this outage.