2025-07-08 Inbound and Outbound call failures on GRR

Updated at July 11th, 2025

+ More

Table of Contents

Affected Services Event Summary Event Timeline July 8th, 2025 July 9th, 2025 Root Cause Impact Summary Future Preventative Action Immediate Preventative Action Long-term Action

Affected Services

Core Services
- Grand Rapids (GRR) Voice

Event Summary

A system process within a critical system service on the GRR (Grand Rapids) core server became unresponsive, causing service disruption for calls terminating (inbound) and originating (outbound) to GRR. The Network Management Service (NMS) process stopped responding at 2:37 PM ET on July 8th, 2025.

Our monitoring system did not trigger alerts for this specific failure, which delayed identification and resolution. Once the root cause was identified at 3:42 PM ET, the NMS service was immediately restarted to clear the stuck process and restore normal call processing functionality. Device registration failover was not automatically triggered because the process controlling this function remained fully operational.

Event Timeline

July 8th, 2025

2:37 PM ET - We began seeing an influx of tickets for inbound call failures and began investigating.

2:57 PM ET - Our engineers began gathering data to identify commonalities and the source of the call failures.

3:14 PM ET - Our engineers identified the source of the call failures to be isolated to the GRR server and declared a major incident.

3:42 PM ET - We identified the source of the call processing failures and restarted the NMS service.

3:43 PM ET - The GRR NMS service fully restarted and we observed inbound calls beginning to process successfully.

4:00 PM ET - We continued to observe successful call processing on the GRR server and maintained monitoring.

July 9th, 2025

8:00 AM ET - We continued to observe successful call processing.

4:00 PM ET - After monitoring for 24 hours, the major incident was declared resolved.

Root Cause

A process within the NMS service that facilitates the setup and breakdown of voice calls had become deadlocked (frozen) and was no longer able to process requests. This caused inbound calls to route directly to voicemail and outbound calls to fail completely.

The process within the NMS service that controls device registration was unaffected, which prevented the automatic device failover mechanism from triggering. This contributed to the extended duration of call failures as devices remained registered to the affected server.

Impact Summary

Voice Services Disruption: Voice services on the GRR server were completely unavailable for 65 minutes (2:37 PM - 3:42 PM ET) due to a critical system service process malfunction
Inbound Call Impact: Inbound calls routed directly to voicemail instead of connecting to users
Outbound Call Impact: Outbound calls failed completely for users registered to the GRR server
Monitoring Gap: The incident was not detected by our automated monitoring systems, delaying identification and resolution

Future Preventative Action

Immediate Preventative Action

Enhanced Monitoring: Implemented additional alerts specifically for the NMS process thread to detect when it becomes deadlocked, enabling quicker identification and response
Process Validation: Added health checks for critical NMS service processes to prevent similar deadlock scenarios

Long-term Action

Vendor Collaboration: Vendor support will investigate the provided data from the time of the process deadlock and implement a fix in a future version of the NMS service
Failover Enhancement: Working with vendor support to improve the automatic failover mechanism to account for partial service failures where registration remains active but call processing is impaired
Monitoring Expansion: Expanding our monitoring coverage to include all critical service processes, not just service-level availability

major incident major incident response mir outage 2025 unnamed piece textless composition july 8