University of Cambridge Computer Laboratory - Disruption to Morello cluster network – Incident details

Disruption to Morello cluster network

Resolved
Under maintenance
Started 9 months agoLasted about 22 hours

Affected

Network

Under maintenance from 6:46 PM to 4:25 PM

Updates
  • Resolved
    Resolved

    This incident has been resolved (though the entire cluster was power-cycled after the earlier maintenance due to a datacentre-wide power outage).

    A further incident will be opened for the planned replacement of the temporary switch.

  • Monitoring
    Monitoring

    We believe the temporary network setup is working. The faulty switch has been powered down. Please contact sys-admin if anything is not right.

  • Identified
    Update

    Repatching of servers is complete; all Morello systems in rack D9 are now connected to the same port on temporary switch wcdc-d9-sw2 that they were previously connected to on wcdc-d9-sw1.

    There will now be a brief outage to routing as we shut off the layer-3 functionality of wcdc-d9-sw1 and allow wcdc-d10-sw1 to take over.

    (The temporary setup involves D9 being daisy-chained from rack D10. The temporary switch is layer-2 only, so all layer-3 functionality - routing, DHCP - will be handled by wcdc-d10-sw1.)

  • Identified
    Update

    Repatching of Morello servers over to the new temporary switch is about to commence.

  • Identified
    Update

    Replacement of the faulty switch is likely to begin at 12:45. I hope to be able to set up the new temporary switch in parallel with the faulty switch to minimise disruption. Once the new switch is configured, each system will be replugged from the old switch to the new, with hopefully only a few seconds of disruption to each.

    A permanent replacement is being obtained (under warranty) and after that arrives we will plan another short outage to install it.

  • Identified
    Update

    The maintenance on the the rack D10 switch is believed to have been successful though work is ongoing to verify.

    The replacement of the rack D9 switch will probably take place tomorrow (18 Oct) late morning / early afternoon.

    For the time being, we believe there is no service-affecting outage.

  • Identified
    Identified

    The Morello cluster network requires urgent disruptive maintenance; it will imminently experience an outage affecting roughly half of the machines.

    The other machines will experience a longer outage tomorrow, as the switch serving those machines has a fault and needs replacement.

    This only affects the Morello cluster; if you don't know what that is, this incident does not affect you.