University of Cambridge Computer Laboratory Status - Incident history

Filer NFS reconnections required

Tue, 20 Aug 2024 16:50:19 +0000

Aug 20, 16:50:19 GMT+0
Identified - Due to an unintended consequence of a minor configuration change, NFS connections to filer have been disrupted on some systems. The likely symptom is an "Invalid argument" error when trying to access any directory on filer. A reboot should solve the problem, or if you have root access to an affected client, "sudo systemctl restart autofs" may also solve the problem..

Aug 20, 17:24:12 GMT+0
Resolved - There is no ongoing problem with filer, but a change at approximately 17:35 caused some clients to lose existing connections to filer as a one-off. Filer access has been fixed on the main departmental servers (slogin, web, mail, etc.). If you encounter problems accessing file on another system and cannot fix it yourself, please contact service-desk@cl.cam.ac.uk..

William Gates Building planned power outage

Sat, 17 Aug 2024 07:00:00 +0000

Aug 17, 07:00:00 GMT+0
Identified - The William Gates Building will be without power for part of Saturday 17th August 2024, due to further planned work on our electrical switch gear on the connection to the building's new solar panels. This additional shutdown is needed to rectify a problem with one of the components installed during the January shutdown. **Nearly all IT services in the William Gates Building will be unavailable for most of the day.** Telephones, office networking and wifi will be unavailable all day (but the building is likely to be closed in any case). Please make sure that all computers in offices are shut down - not just asleep - when you leave on Friday. We will start shutting down servers at 8am ready for the power to be turned off at around 10am. We expect the power to come back on at approximately 1pm but it will then take some time to bring all systems back into operation. We will unfortunately need to shut down all servers in GN09 except for a very small number of critical services such as filer and network infrastructure (which will be powered from a temporary generator), as the cooling system will be offline for several hours and temperatures would otherwise climb to unsafe levels. **This includes nearly all research servers and all GPU servers (including GPU VMs).** GN09 holds almost all of our server hardware; if you are unsure where your server is located, it is probably in GN09 and will probably be affected. (A very small number of research systems are in the West Cambridge Data Centre, and will not be affected.) **The outage is not expected to affect core infrastructure, administrative systems or small VMs** as these are hosted in the West Cambridge Data Centre. However there is a risk that access to filer from these systems will be disrupted; we don't plan to turn filer off, but it is in GN09, its temporary electrical supply is at risk, and we may have to turn it off if it gets too hot. Where a service is replicated between multiple sites, only one instance of the service may be available (this affects most core services such as LDAP, Active Directory and VPN2). VMs hosted by the department will stay running unless they are on the GPU VM clusters (this applies both to VMs with GPUs, and VMs with a lot of CPU cores - generally with names that contain "gpu", "cpu" or "dev"). Services hosted externally to the department, for example by UIS, will not be affected - for example Moodle, CamSIS, HPC, Exchange email, Fastmail email and the main departmental (CST) website..

Aug 17, 07:00:01 GMT+0
Identified - Maintenance is now in progress.

Aug 19, 16:50:27 GMT+0
Completed - The issue affecting power control of servers in GN09 rack 6 has been rectified..

Aug 17, 11:54:37 GMT+0
Identified - The electrical supply has been restored. We are in the process of restoring infrastructure and then GN09 servers. This is likely to take an hour or two..

Aug 16, 11:09:32 GMT+0
Identified - Reminder that the William Gates Building's electrical supply will be turned off tomorrow. Please fully shut down office PCs before you leave today. Research and teaching servers in GN09 will be turned off tomorrow morning, except where already agreed..

Aug 17, 13:34:54 GMT+0
Completed - All infrastructure is believed to be operational again after this morning's electrical work, with the exception of power control for a small number of research servers in GN09 rack 6; a PDU has developed a hardware fault. Power is still being supplied, but cannot be turned off or on remotely. Servers can still be turned off or on via their BMCs, but if you need a server power-cycling please contact service-desk@cl.cam.ac.uk..

Database server maintenance

Wed, 14 Aug 2024 17:31:50 +0000

Aug 14, 17:31:50 GMT+0
Identified - We are applying urgent security updates to the departmental database server and dbwebserver..

Aug 14, 18:47:13 GMT+0
Completed - The user-visible impact of this maintenance has completed; some behind-the-scenes servers continue to be updated but this should not impact any use of the database..

Wifi issue under investigation

Tue, 30 Jul 2024 10:04:56 +0000

Jul 30, 10:04:56 GMT+0
Investigating - We are aware of reports that some people are experiencing delays when trying to connect to Internal-CL and eduroam. If you keep trying, you should eventually be able to connect; alternatively you could _temporarily_ switch to the open "wgb" network and then connect to the VPN -- but note that this is less secure. The issue has been reported to UIS to investigate..

Jul 30, 11:14:18 GMT+0
Monitoring - UIS implemented a fix at 11am. Please report any ongoing issues to service-desk@cl.cam.ac.uk..

Jul 30, 16:41:43 GMT+0
Resolved - We have had no further reports of wifi connection problems since UIS's fix at 11am..

UIS firewall maintenance

Mon, 29 Jul 2024 05:00:00 +0000

Jul 29, 05:00:01 GMT+0
Identified - Maintenance is now in progress.

Jul 29, 05:00:00 GMT+0
Identified - UIS will be carrying out network maintenance on Monday 29 July from 6am to 8:30am (to physically reconnect a data centre firewall to a new network). The central IT services listed below will be unavailable for 10–30 minutes during this period: * CamSIS * CHRIS * CUFS * Research dashboard * X5 * University DNS Service We recommend waiting until after 08:30 before logging in to the services listed above. They may come back online earlier than 8:30am, so you can try to log in if you have urgent work, but please be aware that you may experience connectivity issues. If you experience issues accessing the services after the maintenance period, try logging out and back in. If problems persist, please [contact the UIS Service Desk](https://help.uis.cam.ac.uk/contact-us)..

Jul 29, 07:30:00 GMT+0
Completed - Maintenance has completed successfully.

Partial wifi outage in GS corridor west

Mon, 22 Jul 2024 10:22:01 +0000

Jul 22, 10:22:01 GMT+0
Investigating - We are aware that the wifi access point covering the western end of GS corridor stopped working last night, and are investigating..

Jul 22, 14:21:39 GMT+0
Identified - The wireless access point in GS corridor west has apparently suffered a catastrophic hardware failure, and we are arranging a replacement..

Jul 22, 14:39:47 GMT+0
Identified - A replacement wireless access point will be delivered and installed tomorrow..

Jul 23, 10:11:46 GMT+0
Identified - We've taken delivery of a replacement wireless access point for the western end of the GS corridor; however some facilities work is needed to install it on the ceiling..

Jul 23, 16:40:17 GMT+0
Resolved - The replacement wireless access point for GS corridor west has now been installed. The signal may be variable over the next day or two whilst the system automatically calibrates the new hardware. After that, please report any wifi signal issues to [service-desk@cl.cam.ac.uk](mailto:service-desk@cl.cam.ac.uk). We have also separately been planning a major upgrade to the wireless network in the building, expected to take place in a few months' time. That should improve the wifi signal and wireless network performance throughout the building..

Some mail into Fastmail is bouncing

Mon, 22 Jul 2024 09:59:04 +0000

Jul 22, 09:59:04 GMT+0
Investigating - We are aware that Fastmail are bouncing some mail to users of fm.cl.cam.ac.uk with an error message such as "Error: bare received". We will ask Fastmail to investigate..

Jul 23, 08:30:39 GMT+0
Investigating - We have still not heard back from Fastmail, but mail seems to have stopped bouncing around 14:20 yesterday. We will continue to press Fastmail for an update and will continue to monitor the situation ourselves. If you are a Fastmail user and would like us to check our logs for any mail to you that bounced, please contact service-desk@cl.cam.ac.uk..

Jul 23, 08:39:11 GMT+0
Monitoring - Fastmail has now confirmed that they have implemented a fix for the problem..

Jul 23, 10:33:20 GMT+0
Monitoring - Fastmail has clarified that they have fixed only one of two issues that we reported, and it may be the case that mail continues to bounce with "bare " errors. We are in communication with Fastmail engineers to investigate the remaining problem, which appears to be a complex software bug. Nevertheless mail has not been bouncing for the past 21 hours so the problem may have been fixed by accident..

Jul 30, 17:06:01 GMT+0
Resolved - We have not seen any reoccurrence of this issue since 2024-07-22 at 14:20, which coincides with when Fastmail fixed another problem, so we hypothesise that they fixed this one too by accident. However they are continuing to investigate..

GPU cluster storage maintenance

Tue, 16 Jul 2024 16:00:00 +0000

Jul 16, 16:54:50 GMT+0
Identified - This work is ongoing and is likely to overrun due to an unexpected hardware problem..

Jul 16, 17:36:15 GMT+0
Identified - The outage has overrun due to a problem encountered during the storage server's RAM upgrade. Progress is being made; we can still upgrade the RAM and restore service, just not in the way we expected to. Current estimate for restoration of service: 19:00-19:15..

Jul 16, 18:02:10 GMT+0
Completed - This maintenance has been completed. Personal VMs can be started via Xen Orchestra () as needed..

Jul 16, 11:00:58 GMT+0
Identified - Reminder: this maintenance is taking place at 17:00 today and will require all dev-gpu-\* and dev-cpu-\* VMs to be shut down..

Jul 16, 16:00:00 GMT+0
Identified - The server that hosts storage for the departmental GPU cluster needs an urgent security update, and a reboot. This will necessitate **shutting down all GPU and CPU development VMs**, dev-gpu-\* and dev-cpu-1, including the shared servers dev-gpu-1 and dev-cpu-1\. These VMs' disks, as well as associated data directories (GPU home directories and shared "gpuscratch" space), will be unavailable for about half an hour. As it will take time to shut down and restart the VM infrastructure, VMs will be unavailable for longer: approximately an hour. We will take the opportunity to add RAM to the storage server too, to improve performance..

Jul 16, 16:00:01 GMT+0
Identified - Maintenance is now in progress.

archive-smb maintenance

Tue, 9 Jul 2024 21:00:00 +0000

Jul 9, 21:00:00 GMT+0
Identified - As mentioned in last week's incident, we were awaiting availability of an urgent software update to the software on the new archive server. That update is now available, and will be installed on archive-smb this evening. Expect approximately 15 minutes' outage to \\\\archive-smb.cl.cam.ac.uk (as well as scy27\_traces and corelab\_datasets). This update is necessary to ensure the security of the new archive server..

Jul 9, 21:00:01 GMT+0
Identified - Maintenance is now in progress.

Jul 9, 21:23:21 GMT+0
Completed - Maintenance has completed successfully..

Archive SMB service migration

Thu, 4 Jul 2024 09:00:00 +0000

Jun 17, 18:00:55 GMT+0
Identified - _Text of the email announcement:_ This message is meant for researchers who store data on the **archive server, archive.cl.cam.ac.uk**. Professional services staff can ignore this announcement. Researchers who aren't sure whether they use archive can see below for some guidance on how to tell. **The archive server, archive.cl.cam.ac.uk, is being replaced.** The server will be unavailable on one or two occasions (first one: **4th-5th July**) whilst this work takes place. **You may then need to change how you access the server.** ### **How do I know if I am using archive?** Most people are **not** using archive. However if you contacted us asking for backup space or more than a few hundred gigabytes of storage at any point during the last several years, we might have provided you with storage on archive. We would generally have discussed this with you when creating the storage, though in some cases that would have been many years ago. The archive server is **not** related to filer or bigdisc. Generally any access to archive will be through the hostname **“archive.cl.cam.ac.uk” or simply “archive”**, or paths including the component **"archive" or "gfxdisp"**. (The server name "berilia" is also involved; if you are using anything referring to "berilia" please contact [sys-admin@cl.cam.ac.uk](mailto:sys-admin@cl.cam.ac.uk) as this will stop working.) Archive provides two separate services: NFS (Linux/UNIX style) shares and SMB (Windows-style) shares. (SMB may also be referred to as CIFS. SMB can be accessed from Linux clients too, though mostly it's intended for Windows.) At the moment, both services use the same server name, archive.cl.cam.ac.uk. We will be migrating both services to a new server, but on different days. However, **all** users of archive will be impacted to some extent on that day, and **almost all** users of archive will need to change how they use the service. **You are using archive’s SMB (Windows-style) service, a.k.a. CIFS, if any of the following applies:** * You are using path that look similar to any of the following: * **\\\\archive\\_something_**or **\\\\archive.cl.cam.ac.uk\\_something_**(Windows UNC paths) * **smb://archive.cl.cam.ac.uk/_something_**or **smb://archive/_something_** (Linux/Mac file browser SMB paths) * **//archive.cl.cam.ac.uk/_something_**or **//archive/_something_**(UNC paths as used by some Linux clients such as “mount -t smbfs” or “mount -t cifs”) * You are using “**corelab\_datasets**” (this is a special case; see below) * You are using “**scy27\_traces**” (this is a special case; see below) If any of these apply, see "Impact on SMB users" below; expect a long period of disruption on 4th July, perhaps extending into 5th July, and to have to change how you access archive. **You are using archive’s NFS (Linux/UNIX-style) service if any of the following applies:** * You are manually mounting storage from **archive.cl.cam.ac.uk:/export/_something_ or archive:/export/_something_**, or have one of those paths in **/etc/fstab** * You are using paths under **/auto/archive or /net/archive or /anfs/gfxdisp or /auto/anfs/gfxdisp** If any of these apply, see "Impact on NFS users" below; expect a short disruption on 4th July, then another email about a longer disruption a month or two later. ### **Impact on SMB users** The SMB service will move first, on **4th July**. It will be unavailable for several hours on that day, potentially with disruption extending to the next day due to the large amount of work needed to reinstate this service on a new server. **Immediately after the work on 4th July, the archive SMB service will move to archive-smb.cl.cam.ac.uk.** You will need to update any path that you use that currently refers to the SMB service on archive.cl.cam.ac.uk. **Access permissions on SMB shares will be reset!** For each SMB share, we have identified the current users and will grant those users read and write access to the entire share. Every other member of the department will be given read access (in order to minimise disruption, as we believe there is no confidential data on archive). **If you have confidential data on archive that must not be readable by other users, please contact** [**sys-admin@cl.cam.ac.uk**](mailto:sys-admin@cl.cam.ac.uk) **immediately.** Any custom permissions that may have been set on individual files and folders will be lost. Unfortunately it is not feasible to transfer permissions from the old server to the new one, as we are switching software platform (from Spectra Verde to TrueNAS) and the two platforms store Windows-style ACLs in different, incompatible ways. Whilst the permissions are being rebuilt on 4th-5th July – which is a manual process – you may find that you are unable to access your storage. We will try to restore read access first; write access may take longer. **If you are using corelab\_datasets**: This is a special case; it is a SMB (Windows) share, but several members of your group access it from managed Linux systems via a temporary mechanism. We plan to make this share available via both SMB and NFS, to better support access from both operating systems. We will reconfigure each managed Linux machine on which we’ve set up access to this dataset to use NFS, once this volume is available via NFS. Until then, you will be unable to access corelab\_datasets from Linux. **If you are using scy27\_traces**: As discussed with the users (RT ticket #135996), this is currently a NFS share but is being migrated along with the SMB shares so that we can enable SMB access in future. So this will experience the same disruption as SMB shares, described above, including the reset of permissions. On 4th July it will also start to require a Kerberos ticket for NFS access. ### **Impact on NFS users** Firstly, on **4th July** the NFS service will be unavailable for a short while (approximately 30-60 minutes) as the whole archive server will need to be shut down in order to move the disks holding SMB data to another server. Preparatory work has already been completed to try to expedite this process; however there is a chance of unexpected problems with bringing the NFS service back online on the old server (due to complications with the old server's obsolete software platform). We will prioritise bringing the NFS service back up as quickly as possible on 4th July. There will be a future more-disruptive outage to archive’s NFS service in a month or two when we move that service to the new hardware; the timeline for this is not yet decided. But if you are accessing an archive NFS filesystem from a Lab-managed Linux/UNIX system, you can **prepare now** for the migration by ensuring that you are **not using a path starting /net/archive**. If you are using /net/archive/export/_SHARE_ please switch to **/auto/archive/_SHARE_** as soon as convenient. If you are accessing an archive NFS filesystem from a non-Lab-managed computer by manually mounting archive\[.cl.cam.ac.uk\]:/export/SHARE, we suggest that you email [sys-admin@cl.cam.ac.uk](mailto:sys-admin@cl.cam.ac.uk) now to ask that we help you to set up the Lab automounter on your computer. If you don’t, you’ll have to change to a different path when we move the NFS service. At the moment, NFS shares on archive only support “sec=sys” (IP-address-based access); they do not use any form of strong authentication. After the upgrade, we will be in a position to optionally authenticate NFS using Kerberos, as we do on filer. Contact [sys-admin@cl.cam.ac.uk](mailto:sys-admin@cl.cam.ac.uk) if you are interested in switching your archive NFS share to Kerberos. ### **Updates during the work** Updates on progress during the work (as well as a copy of this announcement) will be posted at . You can subscribe on that page if you would like to receive updates by email. Please contact [sys-admin@cl.cam.ac.uk](mailto:sys-admin@cl.cam.ac.uk) if you have any questions. Thanks..

Jul 4, 09:00:00 GMT+0
Identified - The SMB (Windows-style) storage service on [archive.cl.cam.ac.uk](http://archive.cl.cam.ac.uk) will be moving to a new server and a new hostname. Users will need to change how they access the service. NFS users will be briefly impacted too at this time (and then the NFS service will move to the new server later). Details will be sent out by email and posted here in due course. The work will begin on 4th July, with follow-up work possibly extending into 5th July. Times are approximate at this stage but you should consider the service to be unavailable all day..

Jul 4, 09:00:01 GMT+0
Identified - Maintenance is now in progress.

Jul 4, 12:07:08 GMT+0
Identified - The part of today's work affecting NFS users of archive (other than scy27\_traces which is a special case) is now complete. NFS volumes are available again. If you have problems with the archive NFS service, contact [sys-admin@cl.cam.ac.uk](mailto:sys-admin@cl.cam.ac.uk). SMB / Windows-style volumes are currently unavailable and will be set up on the new archive server (archive-smb) during the rest of today and tomorrow..

Jul 4, 20:54:27 GMT+0
Identified - All affected shares / volumes should now be available again. As previously communicated, the server name for SMB volumes has changed. Where you previously used a path starting with \\\\archive.cl.cam.ac.uk, you will now need to use **\\\\archive-smb.cl.cam.ac.uk**. Also as previously communicated, ownership of files and directories has been reset. Each share has been assigned an owning user or group, based on who we believe is using the share. The owning user or group has full control over the contents of the share; anyone not in the group has read-only access. Detailed permissions (ACLs) and other extended attributes have been removed. If you find that the new permissions are not behaving as you expect, please contact [sys-admin@cl.cam.ac.uk](mailto:sys-admin@cl.cam.ac.uk). You are welcome to add ACLs to your shares again if you wish. They will be stored in a new format behind the scenes, but should work the same as they always did. Some further work is ongoing, for example to restore the use of snapshots. Please note that we anticipate having to install a software update on the new archive server within the next few days (unfortunately it was not available in time to install today), which will cause another brief outage. **Users of corelab\_datasets** can now access their volume from Linux via NFS through the path **/auto/archive/corelab\_datasets**. If it doesn't work, you may need to restart autofs once on your client: if you have root access, use "sudo systemctl restart autofs". Please discontinue the use of the old SMB-based path (/smb/...). In due course we will tidy up the SMB-based setup from those clients on which it has been set up. Thanks for your patience whilst we worked to improve the use of this volume for Linux users. Windows and Mac users can continue to use SMB via the new path **\\\\archive-smb.cl.cam.ac.uk\\corelab\_datasets**. **Users of scy27\_traces** can continue to use **/auto/archive/scy27\_traces** but will now need a Kerberos ticket to do so. If it doesn't work, you may need to restart autofs once on your client: if you have root access, use "sudo systemctl restart autofs". You can also access the volume via SMB if needed..

Jul 6, 14:31:20 GMT+0
Completed - Maintenance has completed successfully..

DS-Print maintenance

Mon, 20 May 2024 06:30:00 +0000

May 20, 06:30:00 GMT+0
Identified - UIS has announced scheduled maintenance on the DS-Print service on the morning of Monday 20th May. Printing to the DS-Print printers in the Department will not be possible between 7:30am and 8:30am, and multi-function devices will be unresponsive. The Department's locally managed printers will be unaffected. UIS are replacing a security certificate and installing an update for one of the technologies we use to provide the service..

May 20, 06:30:01 GMT+0
Identified - Maintenance is now in progress.

May 20, 07:30:00 GMT+0
Completed - Maintenance has completed successfully.

WGB emergency network maintenance

Mon, 6 May 2024 22:15:00 +0000

May 6, 22:15:00 GMT+0
Identified - We will be updating the software on the core router/switch in the William Gates Building (gatwick) in order to attempt to mitigate the ongoing crashes (). This upgrade cannot be performed "live", so there will be approximately 20-30 minutes' outage of the William Gates Building office network, and of filer. Other servers in GN09 should be largely unaffected..

May 6, 22:15:01 GMT+0
Identified - Maintenance is now in progress.

May 6, 22:42:06 GMT+0
Completed - Maintenance has completed successfully..

WGB network problem under investigation

Mon, 6 May 2024 18:10:10 +0000

May 6, 18:10:10 GMT+0
Monitoring - The core switch/router in the William Gates Building (gatwick) appears to have crashed and rebooted; perhaps a reoccurrence of issues a month ago (). Networking should have returned (initially via one switch of the redundant pair that constitutes gatwick, whilst the other switch restarts)..

May 6, 22:42:23 GMT+0
Resolved - This incident has been resolved..

RT upgrade

Fri, 19 Apr 2024 16:00:00 +0000

Apr 19, 16:00:01 GMT+0
Identified - Maintenance is now in progress.

Apr 19, 16:00:00 GMT+0
Identified - We will be upgrading RT during this time. Email to sys-admin, building-services and other email addresses that use ticket numbers will be delayed and not acted upon until the upgrade is complete. The upgrade may take all weekend as it involves a time-consuming database conversion..

Apr 19, 23:08:21 GMT+0
Completed - Maintenance has completed successfully..

gatwick (WGB core network) crashed

Mon, 8 Apr 2024 13:27:00 +0000

Apr 8, 14:23:33 GMT+0
Monitoring - .

Apr 8, 13:27:00 GMT+0
Monitoring - The core router and switch in the William Gates Building (gatwick) seemingly crashed and rebooted at around 14:27. This appears to have been due to a software bug triggered by a routine configuration change. Although gatwick is a virtual switch/router comprising two independent physical systems, it seems that the entire virtual switch/router (both physical systems) rebooted simultaneously. This type of device takes a long time to reboot; in this case there would have been a little over 12 minutes during which the William Gates Building office network was cut off from the University network and the internet (followed by a further few minutes of instability). This would also have affected connectivity to filer and a few other core services hosted in the WGB. Investigation is ongoing into the reason for this outage..

Apr 8, 17:34:29 GMT+0
Monitoring - The routers have remained stable since the crash, but we're going to do some further testing out-of-hours, and install a software update. There may be some further disruption whilst that happens..

Apr 9, 10:26:23 GMT+0
Monitoring - gatwick crashed and rebooted again at around 06:22, again triggered by a routine configuration update. We had, earlier in the night, attempted to install a software update but due to an unrelated issue, the routers refused to do an 'In Service Software Upgrade' - i.e. the upgrade would have caused more disruption - so we chose to roll back and delay this update until Cisco published their notes about this particular version..

Apr 13, 12:58:09 GMT+0
Resolved - This incident has been resolved..

Reduced Caelum Console functionality

Fri, 15 Mar 2024 12:23:08 +0000

Mar 15, 21:02:37 GMT+0
Resolved - .

Mar 15, 12:23:08 GMT+0
Identified - Some functions of Caelum Console are temporarily unavailable due to a server problem which will be rectified as soon as possible..

Mar 15, 19:10:59 GMT+0
Monitoring - A fix has been implemented for the misbehaving Caelum live state database (which had failed, filled the disk with error logs, and caused more problems)..

Lab GPU clusters: poor I/O performance

Mon, 4 Mar 2024 16:54:16 +0000

Mar 4, 16:54:16 GMT+0
Monitoring - We are aware of performance problems on the storage server that hosts dev-gpu/dev-cpu VM disks and data. One cause on a user VM causing a very high I/O load has been found and temporarily mitigated. We will continue to monitor the situation as load still seems higher than expected..

Mar 4, 18:48:39 GMT+0
Resolved - We believe this issue has been resolved, and are in communication with the user who accidentally caused the poor performance. We're also considering hardware upgrades for the storage server..

Intermittent internet connectivity fault - UK-wide DDoS

Mon, 19 Feb 2024 15:43:52 +0000

Feb 19, 15:43:52 GMT+0
Identified - We are aware of an issue affecting the University's connections to the internet. This appears to be a widespread fault outside our control, affecting several institutions..

Feb 19, 16:02:52 GMT+0
Identified - UIS has posted an update: "The University is currently experiencing a DDoS attack. CSIRT are liaising with JISC - we'll update when we know more.".

Feb 19, 16:20:44 GMT+0
Monitoring - Jisc have confirmed that the attacks are across multiple UK universities. We observe that Jisc are implementing strategies to reduce the impact of the attacks, and the network is becoming more usable, though performance is likely to be impaired and some things may not work..

Feb 20, 10:39:44 GMT+0
Resolved - UIS has declared the DDoS-related disruption to be resolved, though they note that there may be some residual issues this morning, including delays to email delivery. We observe that some email was delayed for several hours overnight; Jisc's mitigation of the DDoS attack seems to have blocked some legitimate connections into and out of the University, particularly email-related ones..

Filer inaccessibility under investigation

Thu, 1 Feb 2024 15:02:41 +0000

Feb 1, 18:07:06 GMT+0
Resolved - .

Feb 1, 15:02:41 GMT+0
Investigating - We are aware of reports that some users are unable to access directories on filer, and are investigating..

Feb 1, 15:23:41 GMT+0
Monitoring - We believe the disruption has ended; please report any ongoing issues with filer access to sys-admin..

GPU cluster filesystem outage

Thu, 18 Jan 2024 12:01:32 +0000

Jan 18, 12:01:32 GMT+0
Investigating - GPU VMs (and other VMs running on the same cluster) may have experienced a filesystem fault due to a problem with the storage server currently under investigation..

Jan 18, 12:16:16 GMT+0
Identified - Running VMs will have failed due to their disks briefly becoming inaccessible. Once the server has been stabilised, affected VMs will be shut down and can be started again from XO. However, work is ongoing first to stabilise the server, which experienced an unexpectedly-disruptive problem with its network configuration..

Jan 18, 12:40:13 GMT+0
Identified - The storage system has been stabilised. Some VMs need a filesystem repair due to the ungraceful disconnection of their disks. Each VM that was running during this incident will be started to ensure that its filesystems are clean, repaired if necessary, then (for GPU VMs) shut down again..

Jan 18, 13:15:52 GMT+0
Resolved - This incident has been resolved. Affected VMs have either been started again (in the case of CPU VMs and shared servers) or shut down (in the case of GPU VMs). The latter can be started again as needed from https://xo.cl.cam.ac.uk . The cause of this disruption was that the server on which VM disks and user data are stored failed to start with the correct network configuration after Sunday's planned shutdown (we suspect due to a bug causing it to make some incorrect changes to the configuration during boot). An attempt to rectify this problem 'live' caused about a minute's disruption, unfortunately just long enough to cause Linux VMs to time out I/O operations and shut down their filesystems..

William Gates Building planned power outage

Sat, 13 Jan 2024 17:00:00 +0000

Jan 13, 17:00:00 GMT+0
Identified - The William Gates Building will be without power all day on Sunday 14th January 2024, due to planned work on our electrical switch gear to connect our new solar panels. This is the second and final shutdown planned as part of the solar panel installation. **Nearly all IT services in the William Gates Building will be unavailable for roughly 24 hours, perhaps longer.** We will start shutting systems down on the evening of Saturday 13th January ready for the power to be turned off the following morning; we expect the power to come back on during the evening of Sunday 14th January but it will then take some time to bring all systems back into operation. We expect most services to be available by Monday morning, but there is a small chance that a few things won't initially be working properly on Monday. Telephones, office networking and wifi will be unavailable all day on Sunday (but the building is likely to be closed in any case). Please make sure that all computers in offices are shut down (not just asleep) before Saturday evening. Due to the longer outage this time, we will unfortunately need to shut down all servers in GN09 except for a very small number of critical services such as filer, as the cooling system will be offline all day and temperatures would otherwise climb to unsafe levels. **This includes nearly all research servers and all GPU servers (including GPU VMs).** GN09 holds almost all of our server hardware; if you are unsure where your server is located, it is probably in GN09 and will probably be affected. (A very small number of research systems are in the West Cambridge Data Centre, and will not be affected.) **The outage is not expected to affect core infrastructure, administrative systems or small VMs** as these are hosted in the West Cambridge Data Centre. However there is a risk that access to filer from these systems will be disrupted; we don't plan to turn filer off, but it is in GN09 and we may have to act if it gets too hot. Where a service is replicated between multiple sites, only one instance of the service may be available (this affects most core services such as LDAP, Active Directory and VPN2). VMs hosted by the department will stay running unless they are on the GPU VM clusters (this applies both to VMs with GPUs, and VMs with a lot of CPU cores - generally with names that contain "gpu" or "cpu"). Services hosted externally to the department, for example by UIS, will not be affected - for example Moodle, CamSIS, HPC, Exchange email, Fastmail email and the main departmental (CST) website..

Jan 14, 08:00:00 GMT+0
Identified - The electrical work is in progress. Systems in GN09 including GPU VMs will remain off until the work is complete, tentatively estimated for 16:00. After that it will take some hours to fully restore all systems..

Jan 14, 17:15:17 GMT+0
Identified - Power has been restored to the building. It will now take some time, perhaps hours, to restore all systems starting with core infrastructure. Please be patient if your system remains unavailable..

Jan 14, 15:54:20 GMT+0
Identified - Revised estimate on the restoration of power to the building: 17:00-17:30..

Jan 14, 20:03:02 GMT+0
Identified - Datacentre infrastructure has been restored. Owners of servers can now start them via the Caelum console (if access is set up); owners of GPU/CPU development VMs can start them via Xen Orchestra as usual. Contact sys-admin if any needed system is down or misbehaving..

Jan 15, 13:44:02 GMT+0
Completed - We think that (except where we're already in communication with the affected users about a specific issue) everything is back to normal after the planned electrical shutdown. Please contact sys-admin if you notice any issues..

Loss of UDN resilience

Sun, 7 Jan 2024 16:52:00 +0000

Jan 7, 18:24:13 GMT+0
Monitoring - Connectivity via dist-mwest appears to have been restored..

Jan 8, 11:24:24 GMT+0
Resolved - This incident has been resolved..

Jan 7, 16:52:00 GMT+0
Identified - An hour after the William Gates Building's electrical disruption, we lost two of our four connections to the University Data Network (those connected via UIS's dist-mwest router in the Maths department). We suspect that there is an ongoing electrical issue affecting that router, and its UPS just ran out of battery. This incident is not currently causing us any disruption; however we are now relying on a single upstream UIS router (dist-west in Physics), so we are at risk of disruption if that router fails; due to the proximity of that router to our building, it is possible that it too was affected by the site electrical issues..

Unplanned reboots due to brief unplanned power outage in WGB

Sun, 7 Jan 2024 15:59:11 +0000

Jan 8, 17:36:33 GMT+0
Resolved - This incident has been resolved..

Jan 7, 15:59:11 GMT+0
Investigating - We believe that the William Gates Building just suffered a brief power outage. It is likely that many systems in GN09 and offices have rebooted..

Jan 7, 16:17:19 GMT+0
Monitoring - The power outage appears to have been very brief. We believe that all UPS-protected equipment, in particular all core infrastructure, stayed running. Most, but not all, non-UPS-protected equipment rebooted. Please contact sys-admin if you are unable to access a system after this outage..

Archive server maintenance

Tue, 19 Dec 2023 15:00:00 +0000

Dec 19, 16:00:00 GMT+0
Completed - Maintenance has completed successfully.

Dec 19, 15:00:00 GMT+0
Identified - The 'archive' storage server will be unavailable for a few minutes. We will be reorganising the disks in preparation for a future migration of services onto new hardware. Since several disks will be moving between chassis, this must be done with the service shut down..

Dec 19, 15:00:01 GMT+0
Identified - Maintenance is now in progress.

WebDAV unavailable

Fri, 15 Dec 2023 15:52:10 +0000

Feb 1, 18:06:47 GMT+0
Resolved - .

Dec 15, 15:52:10 GMT+0
Monitoring - The WebDAV service has suffered a catastrophic failure. WebDAV is no longer recommended for use anyway; for remote access it is better to access filer directly via SMB (\\filer.cl.cam.ac.uk\...). This requires a VPN connection. You can contact sys-admin for help setting this up. We have decided not to rebuild the WebDAV service for this reason..

Some GPU cluster VMs crashed

Fri, 15 Dec 2023 02:17:28 +0000

Dec 15, 02:17:28 GMT+0
Identified - Due to a problem during network maintenance, VMs on the departmental GPU cluster briefly lost access to their disks. This caused some VMs to crash. Affected VMs will be rebooted (if CPU VMs) or shut down (if GPU VMs); the latter can be started again from XO..

Dec 15, 03:16:45 GMT+0
Resolved - This incident has been resolved..

Filer maintenance

Tue, 12 Dec 2023 11:00:00 +0000

Dec 12, 12:57:09 GMT+0
Completed - Maintenance has completed successfully..

Dec 12, 11:00:00 GMT+0
Identified - We are planning a major software update on the filer on Tuesday 12th December, with some possible preparatory changes to be made the previous day. Some brief pauses to file access are expected; besides this, we do not expect any disruption, but the risk is slightly elevated. Our NetApp support contractor will be on hand to help with the upgrade and mitigate any issues arising. The exact timing of the work on Tuesday 12th December may be adjusted nearer to the time..

William Gates Building planned power outage

Sat, 25 Nov 2023 08:00:00 +0000

Nov 24, 18:12:05 GMT+0
Identified - Shutdown of GPU VMs will begin at 22:00 today (Friday). Shutdown of affected physical servers will begin at 07:00 on Saturday morning..

Nov 25, 12:46:44 GMT+0
Identified - Power to the building is gradually being restored. GN09 servers will gradually be powered back up when the cooling has reached temperature. Office power and networking in parts of the building will remain down as the maintenance work in wiring cupboards is ongoing..

Nov 25, 15:35:27 GMT+0
Completed - This maintenance has been completed successfully. If any IT systems are still not working, please contact sys-admin. If any electrical circuits remain down, please contact building-services..

Nov 24, 22:14:34 GMT+0
Identified - VMs on the GPU cluster are now shutting down. They can be started again from Xen Orchestra once the electrical work is finished, tomorrow afternoon..

Nov 25, 07:08:07 GMT+0
Identified - Shutdown of listed servers is now beginning..

Nov 25, 08:00:00 GMT+0
Identified - The William Gates Building will be without power for the morning of 25th November, due to planned work on our electrical switch gear to facilitate the upcoming commissioning of a substantial amount of solar power generation. Electrical circuits in the William Gates Building datacentre, GN09, which are connected via the UPS should remain powered throughout the maintenance, running from a backup generator. Other electrical circuits will go down; in particular, a few research servers are connected only to non-UPS circuits. A list of these will be circulated. GN09 will also be without active cooling for the duration of this work; we will have temporary air blowers in place to reduce the buildup of hot air but we may need to shut down high-powered servers (for example GPU and FPGA servers) depending on the weather and temperature. The following list of machines in GN09 will lose power during this maintenance. If possible, please shut them down before the maintenance starts (otherwise we will try to shut them down by pressing the power button). Once we have announced that the maintenance is complete, you can start them again from the [Caelum Console](https://console.caelum.cl.cam.ac.uk/). Please wait for an announcement that the maintenance is complete before attempting to do so. Some other servers may be shut down as well, in particular GPU and FPGA servers, to reduce the electrical and/or thermal load in GN09. - virtual machines on the GPU cluster (dev-gpu-..., dev-cpu-... and others as notified separately) - grumpy - gxp06 - tarawera - ngongotaha - all quorum servers - stix - story - L51 Raspberry Pi cluster - godzilla - tiger - baume - ctsrd-slave2 - rama - cat - chericloud-switch - rado - wenger - wolf0/1/2 - edale - glencoe - sakura - ran - nana - momo - gilling - sigyn - idun - heimdall - nikola01/02/03/04 - acritarch - morello101-dev/102-dev/103-dev - sleepy - doc - sherwood - behemoth - leviathan - excalibur - bam - kinabalu - daintree - marpe - iphito - doris - asteria - all POETS servers - mauao - any other GPU or FPGA server observed to be drawing a lot of power on Friday evening or Saturday morning.

Nov 25, 10:03:23 GMT+0
Identified - The start of the electrical work was delayed by two hours due to a generator fault. The work is now under way but is likely to run past 12:00..

GN09 rack 7: serial consoles unavailable

Wed, 15 Nov 2023 16:33:00 +0000

Nov 15, 16:33:00 GMT+0
Identified - Due to a hardware failure, access to the serial consoles of servers in GN09 rack 7 is currently unavailable. Contact sys-admin if you have problems with the research servers in this rack. This affects: * chervil * serrano1 and 2 * piquin1 and 2 * jar * medellin * nikola05 to nikola08 inclusive.

Nov 17, 18:52:49 GMT+0
Resolved - This incident has been resolved..

UIS wifi maintenance

Fri, 3 Nov 2023 08:00:00 +0000

Nov 3, 08:00:00 GMT+0
Identified - UIS, who operates the wifi networks in the William Gates Building and across the University, has announced urgent maintenance: > We will be carrying out urgent maintenance on the University Wireless Service between 08:00 and 09:00 on Friday 3 November (tomorrow morning) to move services back to the primary clusters in the West Cambridge Data Centre, following the completion of remedial work to the centre’s power system. This will fully restore wireless system resilience. > > All services, including eduroam, UniOfCam-Guest, UniOfCam-IoT and local SSID Wi-Fi connectivity [including Internal-CL and wgb] will be unavailable for 10–20 minutes during the work, but we expect all service to be restored before 09:00..

Nov 3, 08:00:01 GMT+0
Identified - Maintenance is now in progress.

Nov 3, 09:00:00 GMT+0
Completed - Maintenance has completed successfully.

West Cambridge Data Centre: loss of power resilience

Tue, 31 Oct 2023 18:15:03 +0000

Oct 31, 18:15:03 GMT+0
Identified - One of the two power feeds to each of our racks in WCDC has suffered another unannounced outage. Services are currently unaffected as everything can operate from the remaining power feed, but should be considered at-risk. We believe this is planned remedial work, but received no warning of the impact..

Oct 31, 18:23:50 GMT+0
Monitoring - Power has been restored but we believe maintenance is ongoing..

Oct 31, 18:47:01 GMT+0
Resolved - This incident has been resolved..

WCDC remedial works: ATS replacement

Tue, 24 Oct 2023 13:30:00 +0000

Oct 24, 13:30:00 GMT+0
Identified - We are planning to replace an Automatic Transfer Switch in one of our racks in the West Cambridge Data Centre, to reduce the likely impact of future partial power outages. This device automatically switches other devices (servers, network and management infrastructure) between the data centre's two resilient electrical supplies to allow these to remain powered during an outage of one of the supplies. Though this would not have helped with last week's power outages (which affected both supplies simultaneously), in general the power systems in WCDC are designed to limit outages to a single supply at once. We know that the old ATS that we are currently using is not as reliable as it should be, and have a replacement ready. The replacement will result in a loss of power to a few servers (those without their own dual-input power supplies), all of which are part of a resilient service so no user-visible outage is expected: - adsrv07 (one of three Active Directory servers for DC.CL.CAM.AC.UK) - adsrv03 (one of three Active Directory servers for AD.CL.CAM.AC.UK) - sxp12 (one of two DHCP servers) It will also result in loss of networking to a few things for about 10 minutes, as the 1Gbps switches will be power-cycled: - verex01 - cctv01 - tfc-app{1,2,4,5} - management of servers in WCDC (BMCs etc.) Besides that, no user-visible outage is expected..

Oct 24, 13:30:01 GMT+0
Identified - Maintenance is now in progress.

Oct 24, 14:07:55 GMT+0
Completed - Maintenance has completed successfully..

Major outage: West Cambridge Data Centre

Wed, 18 Oct 2023 15:00:00 +0000

Oct 18, 15:00:00 GMT+0
Identified - The UIS West Cambridge Data Centre lost power at around 16:00. Power was restored at around 16:45. Much of our infrastructure was affected and is now restarting. We hope that most services will be back online very soon..

Oct 18, 17:36:48 GMT+0
Monitoring - All services are now believed to be back up, with the possible exception of a few individuals' / research groups' virtual servers, but there is a chance that a few things did not start properly if the servers started before the network and storage infrastructure was ready for them. We are continuing to check services but please now contact sys-admin@cl.cam.ac.uk if you are experiencing any problems..

Oct 18, 17:42:19 GMT+0
Monitoring - UIS has noted that although power was restored, there are ongoing electrical problems in the data centre and services should still be considered at risk of further disruption. They have an engineer en route to investigate..

Oct 18, 18:41:28 GMT+0
Identified - The data centre lost power again..

Oct 18, 20:46:13 GMT+0
Identified - We observe that the power has come back on in our WCDC racks, but with no announcement from UIS we are holding off on starting to restore services for a little while as we believe the power to still be unreliable. Those machines that automatically powered themselves back on already may experience network outages, as we are taking this opportunity to do a little bit of maintenance that would ordinarily be service-affecting..

Oct 18, 22:24:18 GMT+0
Identified - We have had no further information from UIS to the University IT community, but we have learned via an external party (JISC) that the second outage at 19:32 was a deliberate controlled power-down and that as of 20:48 no further outage is expected. We are therefore starting to restore services now..

Oct 18, 23:13:51 GMT+0
Monitoring - All services have been brought back up, with the exception (as before) of some personal or group virtual servers where we're not sure which are supposed to be running. If anything is down that should not be (or vice versa) please contact sys-admin. Services should be considered at risk, still; we will continue to monitor..

Oct 19, 11:06:38 GMT+0
Monitoring - UIS has said that all University services have been restored (with the exception of Research Computing Services / HPC which is expected to be online again this afternoon). We likewise believe that all departmental services are online and stable. Please contact [sys-admin@cl.cam.ac.uk](mailto:sys-admin@cl.cam.ac.uk) if anything is not as it should be. UIS are planning further remedial work in the data centre next week, which may impact services. Thank you for your patience during this incident..

Oct 20, 09:13:51 GMT+0
Monitoring - There are indications from a third party that UIS is planning remedial work for Monday, which may again be disruptive. We will update this incident page as more information becomes available..

Oct 20, 14:17:26 GMT+0
Monitoring - UIS have announced that the main circuit breaker on power supply B will be replaced next week, provisionally on Tuesday at 10am lasting no more than an hour. All our systems *should* automatically switch over to the alternate power supply (A) and be able to operate from only that supply during this work. However we know that a few older systems may reboot during the transition, or may experience a few minutes' network outage as their local switch reboots. We will add a separate scheduled-maintenance incident with more information once UIS has confirmed the timing..

Oct 26, 14:57:18 GMT+0
Resolved - Power to our racks in the West Cambridge Data Centre has been stable since the major electrical incident on 18th October, and following replacement of the Automatic Transfer Switch in our core infrastructure rack, our resilience to future **partial** power outages is improved. The data centre as a whole is however running with reduced power capacity, so the central HPC systems are mostly unavailable. The HPC team currently estimates a return to full service no later than Wednesday 1st November. All HPC users should already be receiving updates from the team by email. UIS's remedial works to restore the data centre's power capacity are ongoing. We have had no specific information about these other than that the next such works are tentatively scheduled for next week. We are closing this incident now as we have no information to suggest that we will experience further disruption, however due to ongoing repairs to the electrical distribution infrastructure, things may of course change..

Disruption to Morello cluster network

Tue, 17 Oct 2023 18:46:20 +0000

Oct 17, 18:46:20 GMT+0
Identified - The Morello cluster network requires urgent disruptive maintenance; it will imminently experience an outage affecting roughly half of the machines. The other machines will experience a longer outage tomorrow, as the switch serving those machines has a fault and needs replacement. This only affects the Morello cluster; if you don't know what that is, this incident does not affect you..

Oct 17, 19:16:12 GMT+0
Identified - The maintenance on the the rack D10 switch is believed to have been successful though work is ongoing to verify. The replacement of the rack D9 switch will probably take place tomorrow (18 Oct) late morning / early afternoon. For the time being, we believe there is no service-affecting outage..

Oct 18, 11:25:33 GMT+0
Identified - Replacement of the faulty switch is likely to begin at 12:45. I hope to be able to set up the new temporary switch in parallel with the faulty switch to minimise disruption. Once the new switch is configured, each system will be replugged from the old switch to the new, with hopefully only a few seconds of disruption to each. A permanent replacement is being obtained (under warranty) and after that arrives we will plan another short outage to install it..

Oct 18, 12:30:55 GMT+0
Identified - Repatching of Morello servers over to the new temporary switch is about to commence..

Oct 18, 12:39:48 GMT+0
Identified - Repatching of servers is complete; all Morello systems in rack D9 are now connected to the same port on temporary switch wcdc-d9-sw2 that they were previously connected to on wcdc-d9-sw1. There will now be a brief outage to routing as we shut off the layer-3 functionality of wcdc-d9-sw1 and allow wcdc-d10-sw1 to take over. (The temporary setup involves D9 being daisy-chained from rack D10. The temporary switch is layer-2 only, so all layer-3 functionality - routing, DHCP - will be handled by wcdc-d10-sw1.).

Oct 18, 12:46:49 GMT+0
Monitoring - We believe the temporary network setup is working. The faulty switch has been powered down. Please contact sys-admin if anything is not right..

Oct 18, 16:25:13 GMT+0
Resolved - This incident has been resolved (though the entire cluster was power-cycled after the earlier maintenance due to a datacentre-wide power outage). A further incident will be opened for the planned replacement of the temporary switch..

archive storage server (berilia) at risk

Tue, 17 Oct 2023 11:57:26 +0000

Oct 18, 11:03:32 GMT+0
Resolved - This incident has been resolved..

Oct 17, 11:57:26 GMT+0
Investigating - The archive storage server has experienced a power supply failure and is operating without resilience until we can source a replacement. Currently there is no impact to service but a single further power fault would take down the service. This is unrelated to the ongoing UPS issue (this server is in a different building)..

Oct 17, 13:04:33 GMT+0
Identified - A replacement power supply has been sourced and is expected to be delivered this afternoon..

GN09 UPS fault under investigation

Tue, 17 Oct 2023 09:06:13 +0000

Oct 17, 09:06:13 GMT+0
Investigating - cr1-ups, our main UPS, is indicating an internal fault. We believe there is no impact to service at this time, but many systems in GN09 are currently at risk. We will be engaging the manufacturer to investigate..

Oct 17, 09:43:16 GMT+0
Investigating - In consultation with the UPS maintenance engineer we believe the alarm is bogus due to a previous engineer's mistake and there is no risk to service; however an engineer will be attending as a precaution..

Oct 17, 13:44:52 GMT+0
Resolved - Our UPS contractor's engineer has attended and cleared the fault..

Wifi authentication outage

Thu, 12 Oct 2023 14:28:00 +0000

Oct 12, 14:28:00 GMT+0
Identified - UIS (who operate our Wifi network and many others around Cambridge) have announced a major incident affecting University wireless services. New connections to eduroam, UniOfCam-Guest, UniOfCam-IoT and Internal-CL are currently failing. Established connections are working, but users are unable to establish new connections or are losing their connection when roaming. In the William Gates Building, as a **temporary** workaround, you can connect to the network named wgb. This is a lower-performance network intended for temporary visitors to the building, with connectivity provided through a commercial ISP (so you may also need to connect to the [VPN](https://www.cl.cam.ac.uk/local/sys/vpn2) to access internal University or Department systems). Please remember to switch back to Internal-CL or eduroam after the outage is over as wgb is less secure, slower and more expensive for us to operate..

Oct 12, 20:49:20 GMT+0
Identified - UIS update as of 21:16: > Our engineers have worked through the evening but have been unable to identify the cause of the connectivity issues with University Wireless services. > > We will liaise with our equipment providers’ technical support teams and provide an update by 08:30 tomorrow morning. > > We regret any inconvenience caused by this issue and appreciate your patience as we work to resolve it..

Oct 12, 21:46:29 GMT+0
Monitoring - UIS update at 22:44: > We believe earlier issues are now resolved but are monitoring the situation.

Oct 13, 10:24:44 GMT+0
Resolved - We believe the connectivity issues with University Wireless services are resolved..

West Cambridge Data Centre cooling fault

Sat, 7 Oct 2023 09:45:52 +0000

Oct 7, 09:45:52 GMT+0
Investigating - We have observed a rapid increase in temperature of our servers in the West Cambridge Data Centre. We have reported this to UIS who manage the facility. We are not currently suffering any outages but may have to turn servers off if the temperature continues to rise..

Oct 7, 10:47:58 GMT+0
Monitoring - UIS have mitigated the cooling fault and temperatures have returned to normal. We are awaiting confirmation of whether the fix is permanent or whether there is still a fault..

Oct 7, 13:15:00 GMT+0
Resolved - UIS have confirmed that the problem is not expected to reoccur; however the WCDC cooling system is running with reduced resiliency until permanent repairs can be made..

berilia (archive) storage server crashed

Thu, 5 Oct 2023 23:10:00 +0000

Oct 5, 23:10:00 GMT+0
Identified - The archive storage server, berilia, ran out of memory and crashed around midnight. It will take some time to come back online..

Oct 6, 00:41:36 GMT+0
Resolved - This incident has been resolved..

UIS urgent network maintenance 17:00-17:30

Thu, 10 Aug 2023 15:53:43 +0000

Aug 10, 15:53:43 GMT+0
Identified - UIS have announced urgent unplanned network maintenance between 17:00 and 17:30 which is expected to affect many UIS, UAS and UIS-hosted services, potentially including the Computer Laboratory's networking. Affected UIS/UAS systems include: - University Data Network (UDN) - Identity Management - CHRIS - CamSIS - CUFS - Remote Access - Servers and Hosting Network - Research Administration Systems - Drupal Content Management System - VDI (AppsAnywhere) - Data Centre Network (DCN) - Virtual hosting - Digital Admissions - PG Funding Systems wholly within the Computer Laboratory should be unaffected, except for their connectivity to the internet and other departments..

Aug 10, 16:40:43 GMT+0
Resolved - We believe UIS's urgent maintenance has been completed, though are awaiting confirmation..

GN09 UPS maintenance

Wed, 7 Jun 2023 12:30:00 +0000

Jun 7, 12:30:00 GMT+0
Identified - Our UPS contractor, Kohler Uninterruptible Power, will be doing routine maintenance on the two UPSes powering parts of GN09. No outage is expected, but services (and in particular research servers with a single power supply) should be considered at-risk..

Jun 7, 13:11:33 GMT+0
Completed - Maintenance has completed successfully..

Wifi hardware replacement

Fri, 2 Jun 2023 07:30:00 +0000

May 26, 13:21:37 GMT+0
Identified - Updated with correct date..

Jun 2, 07:30:00 GMT+0
Identified - UIS will be replacing five wireless access points in the William Gates Building with upgraded models. The following areas will experience periods of disrupted wifi coverage: - Lecture theatres - FN corridor west - SW02.

Jun 2, 12:39:19 GMT+0
Completed - Maintenance has completed successfully..

cr1-sw0/gatwick software upgrade

Wed, 24 May 2023 19:00:00 +0000

May 24, 19:00:01 GMT+0
Identified - Maintenance is now in progress.

May 24, 20:00:00 GMT+0
Completed - Maintenance has completed successfully.

May 24, 19:00:00 GMT+0
Identified - We are upgrading the Cisco IOS XE software on cr1-sw0 aka gatwick, our core switch for the William Gates Building and one of our main routers. There may be brief periods of packet loss, but no major disruption is expected..

William Gates Building electrical shutdown

Tue, 11 Apr 2023 06:30:00 +0000

Apr 11, 14:12:31 GMT+0
Completed - The second floor switches have been fixed (in one case, replaced with spare hardware). All is now believed to be back to normal. As usual, contact sys-admin if anything seems wrong..

Apr 11, 06:30:01 GMT+0
Identified - Maintenance is now in progress.

Apr 11, 06:31:05 GMT+0
Identified - Server shutdown in GN09 will begin shortly. Power outage expected in 30 minutes..

Apr 11, 07:03:24 GMT+0
Identified - Slight delay to the power outage as the engineer from UK Power Networks has been delayed. Please leave machines off, for now..

Apr 11, 07:22:30 GMT+0
Identified - Generator transfer and power outage commencing shortly..

Apr 11, 10:30:00 GMT+0
Completed - Maintenance has completed successfully.

Apr 11, 10:33:55 GMT+0
Identified - A fault affecting the office network on parts of the SC corridor, arising after the power outage, is being investigated..

Apr 5, 10:41:18 GMT+0
Identified - The list of affected machines in GN09 that will be powered down prior to the electrical maintenance can be found at: http://www.wiki.cl.cam.ac.uk/rowiki/SysInfo/20230411AffectedMachines.

Apr 11, 10:44:26 GMT+0
Identified - A switch serving part of the SN and SW corridor also has a fault under investigation. Servers in GN09 should now be back to normal..

Apr 11, 11:13:36 GMT+0
Identified - One of the WC2B switches serving about 50% of connections in the northwest corner of the second floor (SN, SW) appears to have completely failed; a replacement will be set up and installed but this will take a couple of hours. The WC2D switch problem (part of SC) is still under investigation..

Apr 11, 06:30:00 GMT+0
Identified - Power to the William Gates Building will be shut down for approximately 3 hours on the morning of Tuesday 11th April, for routine maintenance of the 11kV substation. Generators will be connected to maintain power to the GN09 UPS-protected circuits and core University/Janet network infrastructure *only*. - Servers in GN09 which are powered solely from mains circuits will be shut down prior to the work and will remain powered off until the maintenance has completed. A list of affected servers is being prepared and will be published as soon as possible. - Servers in GN09 powered by the UPS are at risk of disruption as well, because the process of transferring the UPS to a generator supply runs a small risk of tripping RCDs protecting individual rack circuits within GN09. (Last time we tested this, one circuit out of approximately 48 tripped.) - Core services such as filer and the GN09 core network are expected to remain up, since they are protected by two independent UPSes. - The office network, wifi and phones will be down for the duration of the maintenance; the UPS batteries in the wiring cupboards will not last long enough to sustain service. However we anticipate that the building will be closed during this work anyway as there will be no lighting or other basic services. - We will take the opportunity to upgrade firmware on the office network switches, so there is a small chance of further disruption once the power returns if there is any problem with the firmware update on a switch. - Regardless, once power returns it may take 30 minutes or more for the office network, wifi and phones to be restored to service. This is because the switches take a long time to start up, especially during a firmware update, plus we may have to manually turn wiring cupboard circuits back on throughout the building..

Apr 10, 23:27:57 GMT+0
Identified - Reminder: the William Gates Building's electrical supply will be shut down at 08:00 BST. Please shut down computers in offices before that time. Affected servers in the datacentre, GN09, will be shut down for you starting at 07:30. The list of affected servers in GN09 can be found at http://www.wiki.cl.cam.ac.uk/rowiki/SysInfo/20230411AffectedMachines ..

Apr 11, 10:20:47 GMT+0
Identified - The power maintenance has been completed; power was restored at around 10:49 (after a brief previous restoration). The office network (including wifi and phones) is now starting back up in sequence: ground floor initially, then first floor, then second floor. Each switch will go through an update process which may take 20 minutes or so. We expect that office networking is starting to come back up now but may take a little longer in some parts of the building. Servers in GN09 that were shut down can be restarted using the Caelum Console, https://console.caelum.cl.cam.ac.uk/ . Some servers known to be required are being started for you now..

Apr 11, 12:43:29 GMT+0
Identified - The WC2D switch serving part of the SC corridor is up for the moment, now that we have manually rolled back its failed firmware update, but will reboot again (~15 minute outage) later this afternoon to redo the update. Preparation of a replacement switch for WC2B (a few rooms on SN/SW) is ongoing..

wcdc-sw0/heathrow software upgrade

Sun, 2 Apr 2023 16:00:00 +0000

Apr 2, 16:00:00 GMT+0
Identified - We are upgrading the Cisco IOS XE software on wcdc-sw0, our core switch for servers in the West Cambridge Data Centre, which is also known as heathrow, one of our main routers. No outage is expected..

Apr 2, 16:00:10 GMT+0
Identified - Maintenance is now in progress..

Apr 2, 20:32:44 GMT+0
Completed - Maintenance has completed successfully..

GN09 electrical resilience testing

Fri, 3 Mar 2023 10:00:00 +0000

Mar 3, 10:00:00 GMT+0
Identified - We are planning on testing our recent electrical resilience upgrade to the GN09 datacentre in the William Gates Building by running the datacentre from a mobile generator for a few minutes on Friday morning. No disruption is expected; the UPS will power the datacentre for the very brief time that it takes our new automatic transfer switch to transfer the load to the generator supply. In the event of the generator failing, the UPS will pick up the load again whilst the load is transferred back to the mains electrical supply. We only anticipate disruption in the unlikely event of a UPS battery failure; however the UPS and its batteries have been serviced recently..

Mar 3, 10:00:01 GMT+0
Identified - Maintenance is now in progress.

Mar 3, 10:24:36 GMT+0
Completed - Maintenance has completed successfully..

GBN fibre diversion

Mon, 6 Feb 2023 09:00:00 +0000

Feb 6, 09:00:00 GMT+0
Identified - UIS need to divert two of the fibre circuits connecting the William Gates Building to the West Cambridge Data Centre and to the University Data Network. No disruption is expected as traffic will automatically use alternate paths, but we will be operating with no redundancy for two 10-15 minute periods during 6th February and networking should be considered at risk..

Feb 6, 09:00:01 GMT+0
Identified - Maintenance is now in progress.

Feb 6, 10:52:07 GMT+0
Identified - One of the two affected circuits (one of two from the WGB to the University Data Network) has been successfully diverted and brought back up. The other affected circuit (one of two connecting the two halves of internal our network in WGB and the West Cambridge Data Centre) will be migrated around 11:15. Again, no disruption is expected..

Feb 6, 11:49:10 GMT+0
Identified - Fibre diversion works have been completed. We will continue to monitor the links to verify that they're stable..

Feb 6, 17:00:00 GMT+0
Completed - Maintenance has completed successfully.

GN09 datacentre electrical upgrade

Fri, 13 Jan 2023 09:00:00 +0000

Jan 13, 08:47:43 GMT+0
Identified - Shutdown of affected servers will begin shortly..

Jan 13, 09:00:00 GMT+0
Identified - Work to upgrade power resilience in GN09 will take place from 10th until 13th January 2023, requiring a shutdown of all circuits supplied by our primary UPS for most of 13th January. Some servers used for research are supplied solely from those circuits, and will need to be turned off for the day unless alternative arrangements are made. We have some limited capacity to provide alternative power feeds for a subset of servers. Owners of affected machines have been contacted, and must get in touch as instructed if they require their machine to have a temporary power feed on 13th January. Otherwise, such machines will be powered off for the duration. A larger set of servers will lose networking for the duration unless an alternative power feed is set up for their rack switch. Owners of such affected machines have also been contacted and must get in touch if the network outage would be too disruptive. Servers' out-of-band management (Caelum Console) will be partially unavailable. Besides the above, most other machines in GN09 (except the core network and filer which have a secondary UPS) will be without power resilience for the day. Servers with redundant PSUs will temporarily be vulnerable to a single PSU failure. Immediate widespread disruption is possible in the event of a problem with the mains power..

Jan 13, 13:52:34 GMT+0
Identified - The UPS electrical maintenance work has finished; we will now start re-energising circuits within the datacentre GN09. This will be a gradual process so expect the disruption to continue for the time being. An update will be posted when all rack power feeds are live..

Jan 13, 18:14:13 GMT+0
Completed - Maintenance has been completed successfully. All affected servers and network switches are back up (most as of several hours ago)..

Reduced capacity for GPU VMs this week

Wed, 11 Jan 2023 00:00:00 +0000

Jan 11, 00:00:00 GMT+0
Identified - In order to reduce the electrical load in preparation for [Friday's electrical works](https://cl.instatus.com/clc93nfpv0163hioayo1cq05i), we have had to turn off some of our GPU VM hosts until Friday evening. If you need to use your GPU VM and are unable to start it, contact sys-admin as we may be able to find capacity for you..

Jan 13, 13:28:30 GMT+0
Completed - We have been able to provide sufficient GPU capacity again..