• Home
  • Announcements

2025-07-08 Inbound and Outbound call failures on GRR

Written by Steven Spaulding

Updated at July 11th, 2025

Contact Us

  • The Essentials
    FAQs Forms OIT Learning Academy
  • Announcements
    Carrier Events mFax Events Platform Events Release Notes
  • Billing Administration
    Datagate OneBill
  • Faxing
    mFax - Analog mFax - Digital Native Fax
  • Hardware & Software
    Manual Configuration Provisioning NDP Axis Cisco Fanvil Grandstream Polycom Snom Yealink Mobile Applications Desktop Applications Mobile-X SNAPbuilder TeamMate Connector UC Integrator
  • Hosted Voice
    Auto Attendants Branding Call Queues Call Routing CDRs Conferencing E-911 Features Fraud Integrations Inventory / Phone Numbers Local & Toll Free Porting Onboarding Recommendations SNAP.HD SIP Trunking SMS / MMS Users Voicemail Caller ID
  • Troubleshooting
    VoIPmonitor Firewalls PBX
  • Ray's Stuff
+ More

Table of Contents

Affected Services Event Summary Event Timeline July 8th, 2025 July 9th, 2025 Root Cause Impact Summary Future Preventative Action Immediate Preventative Action Long-term Action

Affected Services

  • Core Services
    • Grand Rapids (GRR) Voice

Event Summary

A system process within a critical system service on the GRR (Grand Rapids) core server became unresponsive, causing service disruption for calls terminating (inbound) and originating (outbound) to GRR. The Network Management Service (NMS) process stopped responding at 2:37 PM ET on July 8th, 2025.
Our monitoring system did not trigger alerts for this specific failure, which delayed identification and resolution. Once the root cause was identified at 3:42 PM ET, the NMS service was immediately restarted to clear the stuck process and restore normal call processing functionality. Device registration failover was not automatically triggered because the process controlling this function remained fully operational.

Event Timeline

July 8th, 2025

2:37 PM ET - We began seeing an influx of tickets for inbound call failures and began investigating.
2:57 PM ET - Our engineers began gathering data to identify commonalities and the source of the call failures.
3:14 PM ET - Our engineers identified the source of the call failures to be isolated to the GRR server and declared a major incident.
3:42 PM ET - We identified the source of the call processing failures and restarted the NMS service.
3:43 PM ET - The GRR NMS service fully restarted and we observed inbound calls beginning to process successfully.
4:00 PM ET - We continued to observe successful call processing on the GRR server and maintained monitoring.

July 9th, 2025

8:00 AM ET - We continued to observe successful call processing.
4:00 PM ET - After monitoring for 24 hours, the major incident was declared resolved.

Root Cause

A process within the NMS service that facilitates the setup and breakdown of voice calls had become deadlocked (frozen) and was no longer able to process requests. This caused inbound calls to route directly to voicemail and outbound calls to fail completely.
The process within the NMS service that controls device registration was unaffected, which prevented the automatic device failover mechanism from triggering. This contributed to the extended duration of call failures as devices remained registered to the affected server.

Impact Summary

  • Voice Services Disruption: Voice services on the GRR server were completely unavailable for 65 minutes (2:37 PM - 3:42 PM ET) due to a critical system service process malfunction
  • Inbound Call Impact: Inbound calls routed directly to voicemail instead of connecting to users
  • Outbound Call Impact: Outbound calls failed completely for users registered to the GRR server
  • Monitoring Gap: The incident was not detected by our automated monitoring systems, delaying identification and resolution

Future Preventative Action

Immediate Preventative Action

  • Enhanced Monitoring: Implemented additional alerts specifically for the NMS process thread to detect when it becomes deadlocked, enabling quicker identification and response
  • Process Validation: Added health checks for critical NMS service processes to prevent similar deadlock scenarios

Long-term Action

  • Vendor Collaboration: Vendor support will investigate the provided data from the time of the process deadlock and implement a fix in a future version of the NMS service
  • Failover Enhancement: Working with vendor support to improve the automatic failover mechanism to account for partial service failures where registration remains active but call processing is impaired
  • Monitoring Expansion: Expanding our monitoring coverage to include all critical service processes, not just service-level availability
major incident major incident response mir outage 2025 unnamed piece textless composition july 8

Was this article helpful?

Yes
No
Give feedback about this article

Related Articles

  • 2025/01/29 - Inbound and Outbound calls failing on the LAS and GRR servers (Resolved)
  • 2025-04-25 Web Socket connections for ATL, GRR and LAS servers failing to connect

Knowledge Base Software powered by Helpjuice

Expand