Equinix LD8 Data Centre POP Outage

Incident Report for Giganet Status Page

Resolved

We have now observed the final remaining affected leased line circuits re-establish. This included a few TTB circuits as well which we omitted in our previous update.

Equinix have provided the following statement on their global Twitter account at 21:55 (BST).
https://twitter.com/Equinix/status/1295826590677504001

Therefore we shall mark this incident as 'resolved'.
Of course our NOC shall be continuing to monitor the network closely.

A full detailed RFO (reason for outage) shall be provided once this has been made available to us by Equinix and their partners. This may take a few days or weeks to produce. (We don't know yet).

We are sorry for the outage today, especially the extended nature of the outage during the business day for those with single-homed services who were worst affected.

This is far from ideal, and we'll be looking for ways that we can improve our resiliency and capabilities to remain as immune to these types of incidents as we can be, however today's incident was truly unprecedented in nature and duration.

The RFO will be provided here in due course.

Posted Aug 18, 2020 - 22:39 BST

Update

Apologies for the slight delayed update.
The power was transferred successfully shortly after the previous message at 19:25.
All equipment is now being provided via the expected PDUs within the rack, and both our LD8 racks are on the new power infrastructure & UPS.
All our equipment is dual-fed from an A & B feed within each rack - as it was before the start of this incident.
We are hoping the new infrastructure is more resilient than the previous - and we await the RFO in this regard.

We continue to see good network performance and all routers/switches/systems/applications hosted in LD8 online since 16:40 & 17:10 respectively for core routing and DBX Managed connectivity.

However, we are aware of a continuing outage affecting some Virgin Media Business (VMB) leased line circuits that appear to route via VMB infrastructure in LD8 - so this is affecting circuits that are terminating in THN as well as those in LD8.
VMB are aware of this incident, and the current feedback we're receiving is that VMB LD8 racks are still currently offline due to the power issue. They sound like they are in the position that we have been throughout the day, eagerly awaiting their rack to be switched onto the new PDU system. This issue with VMB circuits sounds like it's affecting other ISPs as well as us. The current fix time estimate has changed from 21.00, to 21.30, and now that looks to have been missed. We're expecting the next update at 22.00.

All other circuits/systems are online however.

If any customers have any issues, please contact support on the usual channels.

Posted Aug 18, 2020 - 21:38 BST

Update

We've just in the past 15 minutes had our power transferred over to the new PDU and UPS system by Equinix techs.
So our affected rack is now re-energised.

We now need to transfer over the temporary A-feed power to our MX router to the appropriate rack power. This work is considered 'at-risk'. We will provide the all clear once completed.

Posted Aug 18, 2020 - 19:17 BST

Update

We continue to observe good network availability since our LD8 core router has been connected to a temporary power feed from our adjacent rack (that was unaffected by today’s issue).

We have just restored connectivity to DBX Managed customers whose systems are hosted in this data centre. As such they should be fully operational once more.

We believe that Virgin Media Business and Colt leased line circuits are experiencing problems still in LD8 from our network monitoring.
So a few customers are still affected. This includes some Virgin Media Business circuits which terminate in THN - but where the carrier routing appears to take them via LD8? We will continue to investigate this now we have our connectivity restored.

Please note that the network in LD8 is considered ‘at-risk’, as is the rest of the network due to the ‘blips’ that can occur during incidents such as this.
We are still without our usual power feeds in our core networking rack.

We are currently hearing that Equinix plan to restore our racks main power feeds by 19:00.

Our engineer will remain on-site until all our power feeds have restored.

Posted Aug 18, 2020 - 17:15 BST

Monitoring

Our engineer has gained access to the DC and has connected a temporary power feed to our LD8 core Juniper MX router to re-establish our main POP in LD8 from the adjacent rack that we have power.

We have since seen many customer connections restore and are monitoring.

Our LD8 POP is still ‘at-risk’, as our core MX router power is single fed from an ATS (automatics transfer switch) ‘dual feed’ from our adjacent rack.
We still don’t have DC power to our impacted rack.

Please standby for further updates.

Posted Aug 18, 2020 - 16:54 BST

Update

Our on site engineer has managed to re-route some power from a different location to bring services back online, we are seeing most, but not all services back online.

Full power restoration at this data centre is expected by 21:00.

Posted Aug 18, 2020 - 16:49 BST

Update

We have just been given notification from Equinix that they are targeting to have power restored to the entire facility by 21:00 this evening.

We have an engineer onsite awaiting to be let into the DC. This is more for precaution in case there are issues with power restoration as clearly we need power online first. We also have our service vendors on standby in case we have failed hardware.

We are continuing to hear that other ISPs operating out of LD8 are having their power restore, so there’s The chance it may be before 21:00.

Posted Aug 18, 2020 - 16:03 BST

Update

Equinix have now publicly confirmed they have a problem at LD8. 7 1/2 hours after the incident commenced.

https://twitter.com/equinixuk/status/1295677974944129024?s=21

Posted Aug 18, 2020 - 14:08 BST

Update

We're still waiting for our network rack to regain power following Equinix and their contractors migrating power supplies onto the new infrastructure following the earlier fault.

There is sadly still no estimated fix time which is most frustrating. They have assured us that they will provide this information when they can.
Equinix are being continually chased for updates.
As you can appreciate this is a P1 issue affecting many 100s of other carriers/ISPs - so it's been given the maximum priority.

Summary:
1. We have lost both A+B feeds to 1 of our 2 Equinix LD8 racks at approximately 4.23am. This follows a UPS failure, which then triggered the fire alarm in the data centre according to reports from Equinix. The rack that we have lost power to houses our core Juniper MX router and Cisco LNS. The Juniper MX router is our core device which is needed for everything in LD8 to function, including terminating a number of leased line connections as well as providing connectivity to our vDC platform. All our equipment power suppliers are dual fed with 'diverse' A+B power feeds provided by the data centre - however after this incident we suspect that there is a lack of resiliency and will be sure to raise this after the incident is resolved as this is clearly unacceptable to experience a power outage of this gravity.

2. Customers with diverse/resilient leased line circuits should have been operational throughout this incident, as their traffic will have re-routed via our THN alternate data centre. If you circuit is still down and you have managed resiliency, please let us know so we can investigate.
Customers with single-fed leased lines that terminate in LD8 will be offline at this time. We are aware that many of our carriers in LD8 are also offline too, so even if our rack was on, our NNI interconnects would be down.

3. All 'broadband' customers (Openreach/CityFibre/Glide ADSL2+/FTTC/G.Fast/FTTP) should have remained online throughout, as our THN data centre is our primary broadband termination location, with LD8 being backup.

4. Customers with single-homed DBX private cloud phone systems that are hosted in LD8 will be down, however we have been arranging network diverts to ensure inbound calls are still routed to customers. We have a rough 50/50 split of DBX Private cloud systems being hosted in LD8 & THN. So some will be unaffected. The vDC infrastructure powering our DBX platform is powered on, as this is the rack unaffected by the outage, however we cannot communicate to this as it's networked via the other rack.

5. Customers with managed PWANs all have N+1 resiliency throughout their design. So the HA firewall, backup connectivity all re-routed automatically where applicable to THN. So the PWAN core, internet breakout, and traffic routing to sites is 100% operational, albeit potentially on reduced bandwidth to sites where backup circuits are lower bandwidths than the primary.

6. Our secondary core application services such as DNS03, NTP03, SMTP02, RADIUS02 are currently down as these are in LD8. The primary ones are in THN and operational, so all customers should be unaffected. The vDC infrastructure powering our core applications is powered on, as this is the rack unaffected by the outage, however we cannot communicate to this as it's networked via the other rack.

7. Some M12/Giganet hosted services such as our Giganet availability checker are currently down as these are hosted in LD8. The vDC infrastructure powering our availability checker is powered on, as this is the rack unaffected by the outage, however we cannot communicate to this as it's networked via the other rack.

8. As reported in the previous update, we have seen two brief interruptions in service affecting broadband and leased line circuits routing via THN. We suspect this is caused by our carriers/suppliers network equipment powering back up/re-learning routes in LD8, and potential downstream effects on any traffic passing over these links & devices. So we advise customers that there could be still some brief outages/packet loss as services are restored.

We are absolutely focused on restoring services as soon as we can, however we're at the mercy of Equinix and their contractors.
You can find out more about Equinix here: https://www.equinix.co.uk/

We are sorry for the continued disruption. We're doing all we can to put the pressure on to get the swift resolution, and of course there will be a lot of analysis later.

We will continue to post regular updates as we learn more.

Posted Aug 18, 2020 - 12:22 BST

Update

Although this incident is affecting Equinix LD8, we are seeing signs of intermittent packet loss/increased latency at varying times for leased line & broadband circuits terminating in our other data centre - Telehouse North.

We are speculating, but due to the scale of this outage, and the carriers & suppliers affected, that there could be knock-on impacts around the carrier ecosystem/networks.

For instance, we are seeing some carriers report their racks in LD8 are being powered back up, and when this happens there could be increased routing table changes on their network devices that could cause our circuits to be affected that traverse their network.
There is unfortunately not a lot that we can do to mitigate this effect until the conclusion of this incident.

The London Internet Exchange (LINX), are currently reporting that 150 of their members are affected by this outage, to provide the sense of scale of this outage.

We have had no further update from Equinix about a fix time.
Equinix are also preventing any customers from entering the building.

Posted Aug 18, 2020 - 10:37 BST

Update

Sadly we're not getting too many updates from Equinix, or any indication of a fix time, however the latest update suggests to us that the existing data centre UPS battery backup system has failed, and they are expediting the replacement right now!

The failure sounds to have caused the VESDA fire alarm to trigger, which undoubtable would have caused a full building evacuation and delayed any troubleshooting works.

We're continuing to hear reports from other operators located in LD8 that their power is being restored one by one.
We're yet to see power return to our core LD8 rack.
We don't know where we are in the list to be switched over to the new supply.

1. There was a maintenance window on 11th August where Equinix were migrating one of our two racks to new power infrastructure. This work completed without incident.
2. There was meant to be a similar maintenance window tonight where they were completing the migration to our other rack. This is the rack that is currently down.

From Equinix's latest update it seems that they are expediting point 2 above. Pure speculation on this, but one can imagine that the existing UPS has failed, and this offers the fastest way to restore power.

Posted Aug 18, 2020 - 09:46 BST

Update

We have just received the following update from Equinix:
"Equinix IBX Site Staff reports that IBX Engineers and the specialist vendor have begun restoring services to customers by migrating to newly installed and commissioned infrastructure. IBX Engineers continue to work towards restoring services to all customers and further updates will follow when more information is available."

There is (before this incident) scheduled power maintenance works affecting our rack this evening from 21:00.
See https://status.giga.net.uk/incidents/d13072pxgc2x
So we can only assume by the above update that they are working on bringing this work forward now?

Posted Aug 18, 2020 - 08:51 BST

Update

Equinix have provided the following update:

"Equinix IBX Site Staff reports that IBX Engineers and the specialist vendor have begun restoring services to customers by migrating to newly installed and commissioned infrastructure. IBX Engineers continue to work towards restoring services to all customers and further updates will follow when more information is available."

Posted Aug 18, 2020 - 08:48 BST

Update

Unfortunately there has been no further update from Equinix about the ongoing power outage affecting Equinix LD8.
We are aware that the fire alarm incident is still ongoing, and people are unable to enter the building.
We are hearing third party reports that some power is being restored, however we are yet to see our equipment restore.

We apologise for the disruption this is causing, and working with individual affected customers to mitigate as much as we can.

Posted Aug 18, 2020 - 08:42 BST

Update

We are waiting still for a fix however we know from our partners and suppliers network engineers are still unable to access the building for fire safety reasons.

Once we are able to access the building we will attempt to bypass the issue by powering equipment off a different rack.

Posted Aug 18, 2020 - 08:25 BST

Identified

Equinix have provided the following update:
"Equinix IBX Site Staff reports that fire alarm was triggered by the failure of output static switch from Galaxy UPS system supporting levels 1, 2, 3, 4 in building 8/9 at LD8. This has resulted in a loss of power for multiple customers and IBX Engineers are working to resolve the issue."

Posted Aug 18, 2020 - 07:02 BST

Update

We have raised a trouble ticket with Equinix and await their reply.
However according to third party reports, there is a known power issue, as well as possible fire alarm.
It sounds as though power has been lost to certain areas of the data centre, and not the entire facility.
However our entire POP rack has been affected.

Our alternate core POP - THN - is fully operational and supporting the load.
However, any circuits and services which exclusively terminate at LD8 are currently down.

Posted Aug 18, 2020 - 05:36 BST

Investigating

We are currently aware of a mass service outage (MSO) affecting our LD8 POP.
We suspect that we have lost power to the entire rack as no devices, including out of band management devices are accessible.
We have raised this as an urgent ticket to Equinix the data centre provider.

Start Time: 18/08/20 04:24:27

Service Impact: We have lost complete access to our LD8 core POP. Services routing via this POP are currently down.
Our THN core POP is operational.

Further updates will be provided as we learn more.

Posted Aug 18, 2020 - 05:09 BST

This incident affected: Giganet - Broadband and Internet (Giganet Core - Routing, Giganet Core - Broadband (BNG/LNS), Giganet Core - IP Transit, Giganet Core - Peering, Giganet Core - MPLS PWANs, Giganet Core - Hosted Firewalls, Carrier - ELITE/IGNITE (Leased line)), Giganet - Data Centres & Points of Presence (Giganet Core - London Docklands (LD8)), Giganet - Voice Services (Giganet DBX - SpliceCom), and Giganet Network 1 - Core Applications (DNS03 Recursive Server, RADIUS02 Server, SMTP02 Relay Server, NTP02 Server).