Giganet Status Page Status - LD8 Core Router Failure

LD8 Core Router Failure - Thursday 1st April 2021

Incident Report for Giganet Status Page

Resolved

Thank you firstly for your patience whilst we resolved our LD8 core router failure on 1st April (sadly no April Fool!)

It has taken us much longer than anticipated to restore into its full working condition, and the reasons for this shall be shared in an upcoming reason for outage.

Thankfully our network engineering teams devised a disaster recovery solution to mitigate a total loss of service for affected customer services. Although our core router in LD8 has been down following the overnight maintenance scheduled on 1st April, we have continued to provide full services hosted from LD8 throughout this period. As far as we are aware, from both our monitoring systems and lack of support tickets relating to this, those mitigation steps proved successful in maintaining customer connections.

As previously mentioned last night, a replacement LD8 core router was installed last night (https://status.giga.net.uk/incidents/571qmv2dsxnj), and continued monitoring has shown the network to be stable and performing nominally.

As such we will mark this incident resolved.

We shall publish a reason for outage in the next week, however it may take longer for the vendor to provide the root cause analysis (RCA) on what triggered the original failure in 1/4/21, and why the first replacement router was dead on arrival (DOA) for the maintenance scheduled on 2/4/21. We will update the reason for outage should we receive further information.

Posted Apr 06, 2021 - 10:11 BST

Update

Tonight's maintenance was successful.
The LD8 router is now fully operational with peering and transit, in addition to leased line carrier NNIs, vDC interconnects and hosted firewalls.
We shall continue to monitor the router closely for the next few hours before marking this incident as resolved.
A further update shall be provided in the morning.

Posted Apr 06, 2021 - 01:24 BST

Update

The replacement router was successfully installed this evening and services have been switched over to using this new device.

We are waiting for a MAC Address change before our peering and transit come online however expect this to happen shortly which will render all services back in a fully operational state.

We are sorry for any inconvenience this may have caused.

Kind Regards

Networks Team

Posted Apr 05, 2021 - 23:31 BST

Update

A replacement router was delivered yesterday, and work shall commence tomorrow afternoon to prepare it for installation.

Emergency maintenance shall be advertised tomorrow confirming the installation timings.

At this time, all customer services remain online with the DR process, although the network remains 'at-risk'.

Posted Apr 04, 2021 - 19:47 BST

Update

Tonight's emergency maintenance to implement the replacement router did not succeed. Further details on this here: https://status.giga.net.uk/incidents/kgzvny2npk9k

A further replacement router is being scheduled for urgent delivery to our data centre.

Emergency maintenance shall be advertised once we have received the replacement router.

Customers routing out of our LD8 data centre (DC) location are continuing to be provided services via our DR scenario and THN DC.

If customers have any problems, please do report these via the usual channels.

Posted Apr 03, 2021 - 02:40 BST

Update

We have received confirmation from our support vendor that arrangements are being made for Juniper to provide a replacement unit directly from their UK stock.

We are awaiting a further update on the ETA.

At this time, services hosted/routing via LD8 continue to operate via the Telehouse North POP.

We shall post a further update once the ETA is known, or if the stability of the network changes.

Posted Apr 01, 2021 - 18:39 BST

Update

We continue to escalate to our support provider's TAC team for a replacement router. (As a reminder, our service contract is for a 4hr replacement (24x7x365)).

Our support provider are currently making arrangements for a replacement router to be distributed from deeper in their warehousing, specifically from the EU, and also engaging directly with the device manufacturer to ascertain a replacement device ASAP. This is due to supply problems with their UK maintenance pool for this router model.

We currently have no ETA for the replacement router.

We have been informed of the reasons for the supply challenges, and we'll be able to share further in a 'reason for outage' follow-up.

As the arrival of the replacement router is undetermined, we invoked our DR policy earlier this morning to re-provision as much of our customer and core services as possible across to our Telehouse North POP. The majority of this work was completed just before 8am. This meant that the majority of business leased line, MPLS, hosted firewall and private cloud phone system services routing via LD8 remained online.

Shortly before 14:30, we made further progress to re-establish our LD8-based core applications/servers (DNS02/03, NTP02, SMTP01, SIP SBC Gateway 2, RADIUS02). These servers are the duplicated/resilient set of our Telehouse North servers. Due to the extended nature of the outage, we have taken the step to restore access to them via THN, as such we report their status as 'operational'.

The entire core network remains 'at-risk' due to the LD8 core router failure, and associated loss of resiliency across the network.

Broadband services have remained fully operational throughout this network incident.

We apologise to our customers for any inconvenience this disruption has caused. We are doing whatever we can to minimise this disruption until such time that we can re-establish our LD8 core router presence. We believe we have now accomplished a great deal of continuation of normal operations.

If any customers have questions about this, please raise them through the usual support channels.

Posted Apr 01, 2021 - 14:58 BST

Monitoring

Shortly after 7:45am, our network engineers completed the bulk of the Disaster Recover (DR) configuration to extend NNIs and services from LD8 to THN.

This has ensured that we have re-established full connectivity to our leased line carrier NNIs in LD8 and as such we can declare that leased line services, including MPLS services are operational.

For our MPLS customers, the LD8 hosted firewalls are still down due to the LD8 core router outage, however the THN high availability devices are operational; therefore we declare this service operation.

Broadband customers were already routing via Telehouse North, and remain online.

Our core router in LD8 remains offline due to a hardware failure, and thus peering, transit, and other hosted resilient core servers connected through this router are offline. We do have resilient and duplicated infrastructure in Telehouse North which is taking the current load.

The network remains in an 'at-risk' status.

We have escalated once more the replacement of our core router to our support vendor and await their updates.

If customers have any problems, please contact our support team on the usual channels.

Posted Apr 01, 2021 - 08:10 BST

Identified

Following on from tonight’s planned maintenance work (https://status.giga.net.uk/incidents/ccqjtm4301ww), as recommended by our Vendor TAC, our core LD8 router has failed upon the restart procedure. Currently our LD8 core router is down as a result, including most (but not all) services hosted here.

Customers with services routing via our Telehouse North site will be unaffected and those with managed backup/failover will be operating via these circuits.

Engaging with Vendor TAC has recommended a hardware replacement of our LD8 core router as all attempts so far to recover the router have failed.

Although we have a mission critical 4hr onsite SLA hardware replacement with this device, and we initiated the request for support shortly after 2am, our support partner are struggling to provide a replacement router with this 4hr contracted timeframe.

As such, we are mitigating the disruption as much as possible by reconfiguring and rerouting traffic including NNIs across our WDM ring to our THN POP. This is invoking our DR scenario.

The time it takes to invoke this DR scenario and restore services will be determined on the time it takes to reconfigure services across this WDM link to our THN router. We therefore expect customers whose services terminate in LD8 to potentially be experiencing an outage until late morning Thursday 1st April.
We are doing everything we can to bring this forward.

We have escalated the hardware replacement with our support vendor to provide a replacement router ASAP and will be making further such escalations in the coming hours.

Services are gradually being restored including DBX hosted voice customers.
Leased line NNIs will follow.
All broadband connections remain unaffected as these route via Telehouse North, our other core data centre.
The entire network is ‘at-risk’ due to this outage.

We will continue to post updates here as our DR plan progresses and Vendor TAC update us.

We apologise for any inconvenience this causes.

Posted Apr 01, 2021 - 06:20 BST

This incident affected: Giganet - Broadband and Internet (Giganet Core - Routing, Giganet Core - Broadband (BNG/LNS), Giganet Core - IP Transit, Giganet Core - Peering, Giganet Core - MPLS PWANs, Giganet Core - Hosted Firewalls, Carrier - ELITE/IGNITE (Leased line)), Giganet - Data Centres & Points of Presence (Giganet Core - London Docklands (LD8)), Giganet - Voice Services (Giganet SIP Trunks - LD8 SBC), Giganet Network 1 - Core Applications (DNS03 Recursive Server, RADIUS02 Server, SMTP02 Relay Server, NTP02 Server), and Contacting us/ Tools/ Portals (Giganet - Partner Portal).