2025-07-08 Inbound and Outbound call failures on GRR
Table of Contents
Affected Services
-
Core Services
- Grand Rapids (GRR) Voice
Event Summary
A system process within a critical system service on the GRR (Grand Rapids) core server became unresponsive, causing service disruption for calls terminating (inbound) and originating (outbound) to GRR. The Network Management Service (NMS) process stopped responding at 2:37 PM ET on July 8th, 2025.
Our monitoring system did not trigger alerts for this specific failure, which delayed identification and resolution. Once the root cause was identified at 3:42 PM ET, the NMS service was immediately restarted to clear the stuck process and restore normal call processing functionality. Device registration failover was not automatically triggered because the process controlling this function remained fully operational.
Event Timeline
July 8th, 2025
2:37 PM ET - We began seeing an influx of tickets for inbound call failures and began investigating.
2:57 PM ET - Our engineers began gathering data to identify commonalities and the source of the call failures.
3:14 PM ET - Our engineers identified the source of the call failures to be isolated to the GRR server and declared a major incident.
3:42 PM ET - We identified the source of the call processing failures and restarted the NMS service.
3:43 PM ET - The GRR NMS service fully restarted and we observed inbound calls beginning to process successfully.
4:00 PM ET - We continued to observe successful call processing on the GRR server and maintained monitoring.
July 9th, 2025
8:00 AM ET - We continued to observe successful call processing.
4:00 PM ET - After monitoring for 24 hours, the major incident was declared resolved.
Root Cause
A process within the NMS service that facilitates the setup and breakdown of voice calls had become deadlocked (frozen) and was no longer able to process requests. This caused inbound calls to route directly to voicemail and outbound calls to fail completely.
The process within the NMS service that controls device registration was unaffected, which prevented the automatic device failover mechanism from triggering. This contributed to the extended duration of call failures as devices remained registered to the affected server.
Impact Summary
- Voice Services Disruption: Voice services on the GRR server were completely unavailable for 65 minutes (2:37 PM - 3:42 PM ET) due to a critical system service process malfunction
- Inbound Call Impact: Inbound calls routed directly to voicemail instead of connecting to users
- Outbound Call Impact: Outbound calls failed completely for users registered to the GRR server
- Monitoring Gap: The incident was not detected by our automated monitoring systems, delaying identification and resolution
Future Preventative Action
Immediate Preventative Action
- Enhanced Monitoring: Implemented additional alerts specifically for the NMS process thread to detect when it becomes deadlocked, enabling quicker identification and response
- Process Validation: Added health checks for critical NMS service processes to prevent similar deadlock scenarios
Long-term Action
- Vendor Collaboration: Vendor support will investigate the provided data from the time of the process deadlock and implement a fix in a future version of the NMS service
- Failover Enhancement: Working with vendor support to improve the automatic failover mechanism to account for partial service failures where registration remains active but call processing is impaired
- Monitoring Expansion: Expanding our monitoring coverage to include all critical service processes, not just service-level availability