Corrective Action Update for the Heroku June 10th Outage
- Last Updated: September 05, 2025
Beginning at 06:00 UTC on Tuesday, Jun 10, 2025, Heroku customers began experiencing a platform service disruption due to an unintended system update applied to our production infrastructure by our vendor. To compound the issue, the Heroku Status site was affected by the outage. Shortcomings in site design and API latency resulted in timeouts, and the Status site appeared as if there were no active incidents.
On June 15th we published a summary of our initial investigation, mitigation, and root cause analysis. We also identified the following post-incident remediation objectives:
- Ensuring immutable infrastructure
- Increasing resilience of communication channels
- Accelerating investigation and recovery
As promised, we are providing a status update of our continued corrective actions.
Ensuring Immutable Infrastructure
Our June 15th commitment to customers
The root cause of this outage was an unexpected change to our running environment. We disabled the automated upgrade service during the incident (June 10), with permanent controls coming early the next week. No system changes will occur outside our controlled deployment process going forward. Additionally, we’re auditing all base images for similar risks and improving our network services to handle graceful service restarts.
Where we are today
To ensure that future system changes occur only in a controlled manner, we:
- Implemented a permanent halt on all unattended vendor operating system upgrades
- Audited our images to rule out any other sources of mutation
- Developed a risk-based strategy for what types of changes could be safely automated as an attended upgrade
For network resiliency, we added automated startup scripts for our networking services. We are also actively working with our colleagues to help maintain and validate our system images.
Increasing Resilience of Communication Channels
Our June 15th commitment to customers
Our status page failed you when you needed it most because our primary communication tools were affected by the outage. We are building backup communication channels that are fully independent to ensure we can always provide timely and transparent updates, even in a worst-case scenario.
Our approach
Our objective is to move as quickly as possible while providing a smooth transition for customer Status site integrations and without compromising our internal operational safeguards.
Where we are today
We immediately added CDN caching to the Heroku Status site for resiliency and optimized our page load state to eliminate the appearance of false negatives. We are methodically migrating our internal and customer-facing integrations to the Salesforce Trust site, including internal release gating, CLI, and App Metrics integrations. We are also working on a formalized backup incident communications channel for business continuity. From the process side, new Trust site templates and incident commander protocols have been prepared. Heroku has aligned with global incident commander protocols, which require an incident update cadence of at least one update every 30 minutes for active Sev-0 incidents, and at least one update every 60 minutes for Sev-1, and Sev-2 incidents. The Heroku Status site configuration will be fully migrated to the Salesforce Trust site. Beginning on Oct 10th, the Salesforce Trust site will serve as the primary channel for all incident and maintenance communications.
What customers should expect
Incident and Maintenance Notifications
Customers who are currently subscribed to the Heroku Status site will be sent an email to confirm their intent to remain subscribed to incident notifications. Any Status site subscribers that don’t explicitly opt out will be automatically subscribed to the new Trust site.
Status API Migration
We are working on a longer-term Status API migration strategy to minimize disruption for customers with Status API integrations. We will keep Heroku customers informed of future migration expectations, provide migration guidance, and ensure that a minimum of 30 days is provided for customers to migrate their Status API integrations.
Migration instructions and guidance
We will provide Status site migration updates and guidance through the following communication channels:
- Emails to Status site subscribers
- Heroku changelog
- Heroku DevCenter Status site page
- Heroku Status site scheduled maintenance event
Accelerating Investigation and Recovery
Our June 15th commitment to customers
The time it took to diagnose and resolve this incident was unacceptable. To address this, we are overhauling our incident response tooling and processes. This includes building new tools and improving existing ones to help engineers diagnose issues faster and run queries across our entire fleet at scale. We are also streamlining our “break-glass” procedures to ensure teams have rapid access to critical systems during an emergency and enhancing our monitoring to detect complex issues much earlier.
Where we are today
We enhanced our testing and monitoring to more effectively prevent, detect, and diagnose issues, including the addition of:
- Automated regression testing for dyno-to-dyno communications
- Canaries for long-running applications
- Additional monitoring and alerting for our monitoring and alerting orchestration service
- Streamlined monitoring for our platform log drain collection and forwarding service
- Improved introspection and monitoring for our customer notifications service
We are investigating the feasibility of monitoring operating system drift. Additionally, we plan to add canaries for dyno network connectivity.
To reduce the time to issue detection and remediation, we streamlined authorized engineers’ access to Private Spaces and dynos to conduct investigations. We are also working on safe processes at scale to expedite the detection and remediation of configuration-caused incidents.
We streamlined our “break-glass” tooling, and are in the process of revising related procedures for all core services.
Our ongoing commitment
We greatly appreciate the opportunity to serve our customers, and are committed to ensuring that this magnitude of outage and lapse in communications never happens again. We will continue to improve our processes, platform monitoring, performance, and resilience even after we have completed our identified corrective actions. We will keep you informed on the progress of pending corrective actions, including the Trust site migration.
- Originally Published: