
Thank you for stopping by to read The Common Reality. Every Sunday, I drop a new article on Engineering Management, AI in the workplace, or navigating the corporate world. Engagement with these articles helps me out quite a bit, so if you enjoy what I wrote, please consider liking, commenting, or sharing.
Today’s post is a chapter from my new field manual on Technical Incident Management. Dazzle your colleagues with your elite-level skills during your team’s next incident. All you need to do is invest about an hour. The plan is to publish by the end of the month, so stay tuned!
Chapter 4: Ongoing Communication
Once the initial wave of communication has gone out and the team is heads-down working, the Incident Manager’s role shifts toward maintaining a steady, predictable cadence of updates (Outlined in Chapter 3, Step 3 Communicate Early and Often). These updates are not just a formality, they are the primary mechanism for building trust and reducing noise across the organization.
The Incident Manager has several other types of communication that must be distributed both during the incident and after the incident has concluded. Strong communication skills are a hallmark of great leadership. By prioritizing these different messages, the Incident Manager can pivot even the most embarrassing incident to a career progressing opportunity.
Your mantra must be: silence is the enemy.
Executive Notifications
Amazon’s Correction of Error (“COE”) process is a detailed, complete mechanism for studying adverse events and improving the operational posture of an organization. While COEs can cover a range of scenarios, business and technical, they all start with a briefing that is distributed either during an incident or immediately after the incident concludes. This piece of communication is commonly referred to as a “Quick Hit.”
The target audience for this type of communication is the immediate leadership chain of the Incident Manager. To make it easy, consider the chain the next two levels up from the Incident Manager’s location on the organizational chart. The Incident Manager should assume that these two individuals will not be the only ones reading this briefing. It will likely be forwarded further up the management chain as well as laterally across departments and business units.
Write these briefings with facts in mind. This is not the time to be political or point fingers. Peer teams will appreciate your attention to the facts as well as the ownership of the incident being demonstrated. Do not burn their trust or ruin relationships from a place of frustration.
A well-structured briefing includes:
The current understanding of the time bounds of the incident is expressed in date, time, and timezone (e.g. from 7:31 AM ET 4/1/2025 to 9:42 AM ET 4/1/2025) format.
The scope of the impact should be defined in both percentage terms and real terms so that there is no misinterpretation of the event. (e.g., “25% of users unable to log in; 4 of 16 total attempts.”)
A plain-language summary of the issue. Again, this is not the time to render judgment or dazzle readers with technical terminology. Assume that this technical issue will need to be understood by an Executive with no background in technology.
Whether customer data is at risk (yes/no)
Whether external communications have been triggered and, if so, when the first communication was distributed
An initial determination of whether or not a root cause analysis needs to be performed. If the determination is “yes,” the Incident Manager should link to where the analysis will be posted, along with a timeline to publish.
The following example illustrates a briefing that aligns with the above requirements:
On April 20, 2025, from 11:15 AM ET to 12:05 PM ET, approximately 25% (195 of 783 requests) to the Payment API had resolution times greater than 3 seconds. The timeout configuration of the payment application terminates at three seconds on API requests, resulting in customers being unable to complete the checkout experience. Customers were presented with an error banner indicating that they would need to try the transaction again. 54 customers were impacted during the event - no customer data was lost or exposed. An email to impacted customers has been drafted and is being reviewed for distribution with a target of April 20, 2025, at 5:00 PM ET.
The issue was mitigated by rolling back a recent change to the Payment API’s database configuration, which was designed to optimize database connections. The team will investigate why these configuration changes did not produce a similar error in the QA environment during testing. A report of the findings will be available on May 2, 2025, here.1
Stakeholder Engagement
Stakeholders come in many forms: customer support, product owners, marketing, sales, partner teams, or external vendors. The Incident Manager is not responsible for answering everyone’s questions in real-time, but they are responsible for ensuring updates flow in the right direction.
Some tips:
Post updates in known, visible places (Slack/Teams channels, status pages, email distribution lists).
Keep a list of primary internal stakeholders for each service. Update this list quarterly.
Do not rely on other teams to “check the ticket.” Communication is the sender’s job. The ticket exists as an audit record.
If updates are delayed or uncertain, say so. Your stakeholders would rather hear “no new updates” than hear nothing at all. There is no magic here. Most of your stakeholders either understand what dealing with an incident is like or assumes some massive technical situation is happening. They just don’t want to be ignored - being proactive is a superpower here and the bonus is that it is easy to wield.
Emotional Balance During Communication
The tone of the updates matters. A chaotic-sounding update creates more stress. A dry, vague one can come across as dismissive. The Incident Manager’s goal is calm, factual, and confident, even if everything behind the scenes is still uncertain. The Incident Manager needs to control the narrative.
Avoid being the author of an update that has already decided they going to be fired for this incident. Incidents can be very stressful events, but very rarely does a single incident result in the termination. Even the Apple Executive directly responsible for the disaster that was the initial rollout of Apple Maps wasn’t going to be fired for that incident until he refused to apologize for the rollout.2
Firing off ad hoc email updates with incomplete information and emotional undertones will be career-limiting for those who opt to go down that path. The Incident Manager who demonstrates ownership over the situation, details the steps that are being taken to move toward resolution, acknowledges the impact on other teams/customers, and regularly updates their stakeholders will absolutely be viewed as someone that needs to be retained.
Here’s what to avoid:
Overly technical updates that mean nothing to non-engineers
Updates that are too casual for serious incidents (“still poking around…”)
Panic-inducing updates that exaggerate severity or uncertainty
Handling Long-running or Ambiguous Incidents
Some incidents won’t follow a clean arc. Most will, but for a small percentage of incidents, there will be a very unclear path to resolution. One that may take days to unravel and mitigate. These can be some of the most painful and frustrating experiences of an Incident Manager’s career, as the longer an incident takes to resolve, the greater the pressure to resolve it will become.
Maybe the team has reduced the blast radius of the incident, but doesn’t have a root cause that enables them to fully mitigate the issue. Maybe the impact is intermittent or tied to a third-party vendor. This is where communication discipline matters most.
In these situations:
The Incident Manager should be transparent about the current state, what’s known, and what’s unknown.
There should be a clear summary of what has been ruled out.
The Incident Manager should be clear about containment and reduction in customer impact.
Extend the update interval only after stating so clearly. Meaning, if updates will shift to the end of the day, be clear as to why that time interval is now acceptable.
A lack of resolution does not excuse a lack of communication. The Incident Manager must continue to be proactive with their communication and consistent with their cadence even when the path forward is unclear.
Remember…
Great communication is great leadership. When incidents occur (and we know that they will), the Incident Manager who is proactive, clear, and concise with their communication is going to demonstrate to their management chain that they are a leader worth investing in. Further, consistently great communication will drive both customer and stakeholder satisfaction over the long run. This is one of the easiest ways to build trust; take advantage of it.
A link to where the report is going to be published is critical. By providing the link before the report is ready, the Incident Manager is building trust with the audience that the issue is under control.
https://www.cbsnews.com/news/report-apple-exec-refused-to-apologize-for-maps/