Caelum Console (server management) - Operational
Caelum Console (server management)
Request Tracker - Operational
Request Tracker
Other Internal Services - Operational
Other Internal Services
External Services - Operational
External Services
Network - Operational
Network
GN09 - Operational
GN09
WCDC - Operational
WCDC
Main VM Pool (WCDC) - Operational
Main VM Pool (WCDC)
GPUs - Operational
GPUs
Secondary VM Hosts - Operational
Secondary VM Hosts
Xen Orchestra - Operational
Xen Orchestra
Filer - Operational
Filer
Archive Server - Operational
Archive Server
Data Replication - Operational
Data Replication
Other Secondary Storage Systems - Operational
Other Secondary Storage Systems
Notice history
Dec 2023
Nov 2023
- CompletedNovember 25, 2023 at 3:35 PMCompletedNovember 25, 2023 at 3:35 PM
This maintenance has been completed successfully.
If any IT systems are still not working, please contact sys-admin. If any electrical circuits remain down, please contact building-services.
- UpdateNovember 25, 2023 at 12:46 PMIn progressNovember 25, 2023 at 12:46 PM
Power to the building is gradually being restored. GN09 servers will gradually be powered back up when the cooling has reached temperature.
Office power and networking in parts of the building will remain down as the maintenance work in wiring cupboards is ongoing.
- In progressNovember 25, 2023 at 10:03 AMIn progressNovember 25, 2023 at 10:03 AM
The start of the electrical work was delayed by two hours due to a generator fault. The work is now under way but is likely to run past 12:00.
- PlannedNovember 25, 2023 at 8:00 AMPlannedNovember 25, 2023 at 8:00 AM
The William Gates Building will be without power for the morning of 25th November, due to planned work on our electrical switch gear to facilitate the upcoming commissioning of a substantial amount of solar power generation.
Electrical circuits in the William Gates Building datacentre, GN09, which are connected via the UPS should remain powered throughout the maintenance, running from a backup generator.
Other electrical circuits will go down; in particular, a few research servers are connected only to non-UPS circuits. A list of these will be circulated.
GN09 will also be without active cooling for the duration of this work; we will have temporary air blowers in place to reduce the buildup of hot air but we may need to shut down high-powered servers (for example GPU and FPGA servers) depending on the weather and temperature.
The following list of machines in GN09 will lose power during this maintenance. If possible, please shut them down before the maintenance starts (otherwise we will try to shut them down by pressing the power button). Once we have announced that the maintenance is complete, you can start them again from the Caelum Console. Please wait for an announcement that the maintenance is complete before attempting to do so.
Some other servers may be shut down as well, in particular GPU and FPGA servers, to reduce the electrical and/or thermal load in GN09.
- virtual machines on the GPU cluster (dev-gpu-…, dev-cpu-… and others as notified separately)
- grumpy
- gxp06
- tarawera
- ngongotaha
- all quorum servers
- stix
- story
- L51 Raspberry Pi cluster
- godzilla
- tiger
- baume
- ctsrd-slave2
- rama
- cat
- chericloud-switch
- rado
- wenger
- wolf0/1/2
- edale
- glencoe
- sakura
- ran
- nana
- momo
- gilling
- sigyn
- idun
- heimdall
- nikola01/02/03/04
- acritarch
- morello101-dev/102-dev/103-dev
- sleepy
- doc
- sherwood
- behemoth
- leviathan
- excalibur
- bam
- kinabalu
- daintree
- marpe
- iphito
- doris
- asteria
- all POETS servers
- mauao
- any other GPU or FPGA server observed to be drawing a lot of power on Friday evening or Saturday morning
- UpdateNovember 25, 2023 at 7:08 AMIn progressNovember 25, 2023 at 7:08 AM
Shutdown of listed servers is now beginning.
- In progressNovember 24, 2023 at 10:14 PMIn progressNovember 24, 2023 at 10:14 PM
VMs on the GPU cluster are now shutting down. They can be started again from Xen Orchestra once the electrical work is finished, tomorrow afternoon.
- PlannedNovember 24, 2023 at 6:12 PMPlannedNovember 24, 2023 at 6:12 PM
Shutdown of GPU VMs will begin at 22:00 today (Friday).
Shutdown of affected physical servers will begin at 07:00 on Saturday morning.
Oct 2023
- ResolvedResolved
Power to our racks in the West Cambridge Data Centre has been stable since the major electrical incident on 18th October, and following replacement of the Automatic Transfer Switch in our core infrastructure rack, our resilience to future partial power outages is improved.
The data centre as a whole is however running with reduced power capacity, so the central HPC systems are mostly unavailable. The HPC team currently estimates a return to full service no later than Wednesday 1st November. All HPC users should already be receiving updates from the team by email.
UIS's remedial works to restore the data centre's power capacity are ongoing. We have had no specific information about these other than that the next such works are tentatively scheduled for next week.
We are closing this incident now as we have no information to suggest that we will experience further disruption, however due to ongoing repairs to the electrical distribution infrastructure, things may of course change.
- UpdateUpdate
UIS have announced that the main circuit breaker on power supply B will be replaced next week, provisionally on Tuesday at 10am lasting no more than an hour. All our systems should automatically switch over to the alternate power supply (A) and be able to operate from only that supply during this work. However we know that a few older systems may reboot during the transition, or may experience a few minutes' network outage as their local switch reboots.
We will add a separate scheduled-maintenance incident with more information once UIS has confirmed the timing.
- UpdateUpdate
There are indications from a third party that UIS is planning remedial work for Monday, which may again be disruptive. We will update this incident page as more information becomes available.
- UpdateUpdate
UIS has said that all University services have been restored (with the exception of Research Computing Services / HPC which is expected to be online again this afternoon). We likewise believe that all departmental services are online and stable.
Please contact sys-admin@cl.cam.ac.uk if anything is not as it should be.
UIS are planning further remedial work in the data centre next week, which may impact services.
Thank you for your patience during this incident.
- MonitoringMonitoring
All services have been brought back up, with the exception (as before) of some personal or group virtual servers where we're not sure which are supposed to be running. If anything is down that should not be (or vice versa) please contact sys-admin.
Services should be considered at risk, still; we will continue to monitor.
- UpdateUpdate
We have had no further information from UIS to the University IT community, but we have learned via an external party (JISC) that the second outage at 19:32 was a deliberate controlled power-down and that as of 20:48 no further outage is expected. We are therefore starting to restore services now.
- UpdateUpdate
We observe that the power has come back on in our WCDC racks, but with no announcement from UIS we are holding off on starting to restore services for a little while as we believe the power to still be unreliable.
Those machines that automatically powered themselves back on already may experience network outages, as we are taking this opportunity to do a little bit of maintenance that would ordinarily be service-affecting.
- IdentifiedIdentified
The data centre lost power again.
- UpdateUpdate
UIS has noted that although power was restored, there are ongoing electrical problems in the data centre and services should still be considered at risk of further disruption. They have an engineer en route to investigate.
- MonitoringMonitoring
All services are now believed to be back up, with the possible exception of a few individuals' / research groups' virtual servers, but there is a chance that a few things did not start properly if the servers started before the network and storage infrastructure was ready for them. We are continuing to check services but please now contact sys-admin@cl.cam.ac.uk if you are experiencing any problems.
- IdentifiedIdentified
The UIS West Cambridge Data Centre lost power at around 16:00. Power was restored at around 16:45. Much of our infrastructure was affected and is now restarting. We hope that most services will be back online very soon.