MSO - Telehouse North (THN) Core Router Instability and Outage
Incident Report for Giganet Status Page
Resolved
After a further period of monitoring, we have determined this incident to be resolved.

Juniper TAC have now taken over the investigations of the router log files.

A postmortem to this problem will be updated here once those investigations and our own have concluded.

We apologise once again for the outage this morning.
Posted Feb 24, 2021 - 20:50 GMT
Update
We are continuing to see good traffic levels and performance since the THN Juniper MX router process restart.

Preliminary timeline of events (subject to final postmortem confirmation) all times GMT:
10:52: THN core router problems start, alerts triggered
10:57: MSO raised, and communicated on our Network Status Page
10:10: Some routing and connections start to recover in THN by themselves.
11:27: Decision take to restart core routing process on the THN router, as memory utilisation still far too high, and not all traffic routing optimally. Logs and information captured prior to this ready for JTAC case.
11:28: THN routing processes restarted.
11:35: THN core routing for transit, peering, THN-homed leased lines sessions reestablished and traffic levels normalise
11:55: Final broadband sessions restored.

We are now follow up with Juniper TAC to understand what caused the memory issues and this incident.
The delay in some broadband sessions not restoring until 11:55 and not failing over to our secondary data centre is a separate problem to this incident. We identified a configuration problem in this regard, and resolved immediately once aware.

A postmortem will be published covering off all aspects of today's incident in the coming days, after Juniper JTAC engagement.

Continued close monitoring of our network will be maintained today especially, and we'll post further updates here over the day to confirm the status.

---------
We're sorry for the interruption to your Giganet and M12 Solutions services today, caused by this incident.

Clearly an unprecedented outage affecting one of our core data centre sites and equipment.
The affected THN Juniper MX core router had been operational for 629 days prior to this incident, and was otherwise performing well beforehand.
This has been the first unplanned core router outage we've experienced in the over 7 years of operating a Juniper core network. (Disclaimer: apart from that caused by last August's Equinix LD8 complete power outage which affected 100s of other ISPs and cloud service providers - https://status.giga.net.uk/incidents/3zcfz8s8g43h).

Clearly any outage is a bad outage, and so we're going to be making lots of key learnings about this incident.
Being connected is even more important right now, so we understand the impact of this incident and hope to restore your trust in Giganet.

If you have any questions or concerns, please do raise these to your account manager or reach our to our support team.

Matthew Skipsey
CTO
Posted Feb 24, 2021 - 13:29 GMT
Monitoring
We started the THN core routing process 35 minutes ago, and 5-10 minutes following this we have observed more stability and reestablishment of connections.

We saw the final broadband connections on some VRF MPLS connections restore 5 minutes ago. Sadly theses didn't appear to gracefully failover to LD8, so we shall investigate this as a separate issue.

If you are still experiencing problems with your connection, please reboot you connection.
If this fails to restore service, please contact us.

We apologise for the inconvenience this outage caused. It's our priority to ensure stability right now.
Posted Feb 24, 2021 - 12:01 GMT
Update
We have manually steered broadband sessions across to our LD8 router and are in the process of attempting a recovery of our THN router.
Posted Feb 24, 2021 - 11:28 GMT
Update
We are seeing things restore, but our engineers are still investigating.
We are currently investigating a potential memory issue.
Posted Feb 24, 2021 - 11:13 GMT
Investigating
We are currently investigating a major service outage affecting our Telehouse North Site.

Engineers are currently investigating.
Posted Feb 24, 2021 - 10:57 GMT
This incident affected: Giganet - Broadband and Internet (Giganet Core - Routing, Giganet Core - Broadband (BNG/LNS)) and Giganet - Data Centres & Points of Presence (Giganet Core - London Docklands (THN)).