2025-06-04 Loss of Communication to GRR Server
Table of Contents
Affected Services
- Grand Rapids (GRR) Voice
Event Summary
Communication failures to the GRR server resulted in two brief service interruptions, causing device registrations and inbound calls to automatically failover to alternate servers. The root cause was identified as degraded network performance from the upstream ISP (Internet Service Provider), specifically packet loss on their Lumen/Level3 peer connection, which created instability in the site-to-site VPN (Virtual Private Network) tunnels and routing protocols.
Event Timeline
June 4, 2025
10:30 AM ET – Our monitoring system alerted us that the GRR server was unreachable for two minutes, resulting in device registrations and inbound calls failing over to other servers.
10:36 AM ET – Communication was restored to the GRR server and all device registrations returned to the GRR server.
10:51 AM ET – Communication with the GRR server was lost a second time.
10:54 AM ET – Communication with the GRR server was restored and we began seeing devices register back to the GRR server.
11:01 AM ET – All devices had returned to the GRR server. The decision was made to continue monitoring rather than taking GRR offline, as our vendor confirmed actions had been taken on their end to prevent further communication loss.
11:30 AM ET – The GRR server continued to remain stable.
12:22 PM ET – Working with our vendor, we determined that the loss of communication was due to a degraded circuit. The affected circuit was removed from the routing profile. Communication with the GRR server remained stable since this change. Moved to monitoring phase.
12:50 PM ET – GRR continued to show stability. Monitoring continued.
5:00 PM ET – We received notice from our vendor that the data center would be performing maintenance and repair to the MPLS (Multiprotocol Label Switching) configuration overnight.
June 5, 2025
2:00 AM ET – The vendor and data center began maintenance on the MPLS path.
5:32 AM ET – We received confirmation that maintenance had been completed and the MPLS path had been returned to service with all components showing healthy status.
8:30 AM ET – We confirmed continued stability of the GRR server.
1:00 PM ET – After 24 hours of sustained stability, the incident was marked as resolved.
Root Cause
The GRR upstream ISP experienced degraded network performance in the region. The ISP reported packet loss on their upstream Lumen/Level3 peer connection, which required BGP (Border Gateway Protocol) changes to reroute traffic and restore connectivity.
The packet loss event created a cascading effect on the site-to-site VPN tunnels and routing protocols, resulting in BGP session instability due to underlying problems in the MPLS label-switched paths.
Impact Summary
Device registration and voice services on the GRR server experienced two brief interruptions due to communication loss with the server. During these periods, automatic failover successfully redirected services to alternate servers, maintaining service continuity for affected users. Total combined downtime was approximately 9 minutes across both incidents.
Future Preventative Action
Immediate preventative actions taken:
- The degraded circuit was immediately removed from the routing profile to prevent further instability
- Enhanced monitoring was implemented during the recovery period to ensure sustained stability
Long-term actions:
- We are working with our ISP and data center partners to implement additional redundancy measures and improved monitoring of upstream peer connections
- Review of our current failover thresholds and timing to optimize automatic recovery processes
- Enhanced alerting protocols to provide earlier notification of upstream network degradation