
Thank you for stopping by to read The Common Reality. Every Sunday, I drop a new article on Engineering Management, AI in the workplace, or navigating the corporate world. Engagement with these articles helps me out quite a bit, so if you enjoy what I wrote, please consider liking, commenting, or sharing.
The Philosophy of Readiness
Accept that your systems will go down. They will, and for causes that frustrate you with how obvious they were in retrospect. Accepting this inevitability will allow you to start your journey. And deep down inside, you know that your systems will go down; that’s why you are here right now.
Or you are here because your systems have already gone down. Maybe more than you like, and you want to build new skills and processes to handle those outages. Our goal is to instill within your team a culture of resilience and readiness.
Incidents are not a sign of failure. They’re a reality of operating complex, interdependent systems at scale.
Preparation isn’t just about having tools or documentation; it’s about mindset. A prepared team treats every quiet week as a chance to sharpen their tools. The best teams rehearse responses, improve alerts, and audit knowledge gaps.
When the incident begins, they are ready.
Roles and Responsibilities
In the heat of an incident, confusion about who is doing what can be just as damaging as the technical issue itself. You need a clear, practiced structure. The most effective teams assign key roles before anything breaks:
Incident Manager (“IM”): Owns the response, delegates tasks, and maintains focus. This should be the Manager of the team that owns the impacted service. The Incident Manager is responsible for internal communication.
Scribe: Takes notes, maintains the timeline, and documents decisions in real time. The scribe can be Copilot (if you are using Microsoft Teams) or some other tool that transcribes Incident calls in real time.
Technical Leads/Subject Matter Experts (“SMEs”): Diagnose the system, propose theories, and take corrective action.
Rotate the team members in the Technical Lead/SME role regularly to avoid burnout and ensure redundancy. Don’t wait until a high-severity incident begins to figure out who knows or doesn’t know your systems.
On-Call Hygiene
Solid incident response starts with clear on-call coverage. Teams in a more pure DevOps model will support their systems, and all engineers will be in the on-call rotation. Teams in a more traditional “Development & Operations” model or a Site Reliability model will be responsible for knowledge transfer to the supporting teams.
If the organization has a line between Development and Operations, the Development team must prioritize building tools, documentation, and runbooks to appropriately equip the Operations team to support the service(s). Otherwise, all of the hard incidents will come right back to the Development team for triage.
Everyone on call should know:
When their shift starts and ends.
What systems they’re responsible for, and where the runbooks for those systems are.
How to escalate to the right people.
Equally important is alert hygiene. Reduce noise aggressively, no one can focus if they’re being paged 40 times a day for low-value signals. Alerts should be:
Actionable.
Clear in intent.
Routed to the right team.
Lastly, ensure on-call engineers have access to everything they might need: dashboards, runbooks, credentials, logging tools, and the org chart. Before a new member of the team is eligible for the on-call rotation, they should go through a checklist of all resources and perform at least one shadow rotation (meaning they partner with an on-call engineer and attempt to do every activity that comes up to ensure access is available).
Tooling Setup
The tooling owned by a team in support of their services is a strong indicator of the operational readiness of that team. Leading teams have thought about - and then built - the core tools that they would need to support their services. These teams would regularly discuss enhancements to these tools as new incidents occur.
Tools can be foundational (see below) or purpose-built scripts and functions that live in code repositories. Regardless of the nature of the tool, each should be well documented and referenced within the Team’s runbooks.
At a minimum:
A paging system (e.g., PagerDuty or Opsgenie) with clear escalation policies.
Runbook repositories are linked directly from alerts or dashboards.
Small details matter: Make sure dashboards load fast, logs are not delayed, and systems don’t require a dozen logins under pressure. These impediments should always be discussed in Team meetings as opportunities for improvement.
Runbooks and Documentation
Runbooks are your operational safety net. They are the difference between a responder scrambling through Slack messages or tribal knowledge and a confident, focused response to a known failure mode. A good runbook doesn’t just save time; it reduces stress, prevents mistakes, and enables newer team members to take meaningful action during a high-pressure event.
Good runbooks are:
Short: Nobody reads a novel at 2 am.
Searchable: Organize them in a shared wiki or tool like Backstage, Confluence, or Git.
Current: Assign owners and require periodic reviews.
Keep them version-controlled and accessible, ideally one click away from alerts or dashboards (Many of our alerts at Amazon contained a link directly to the runbook in the payload of the alert). Tools like Backstage, Confluence, or GitHub Pages work well as central sources of truth. Whenever possible, link to live dashboards, log queries, or scripts directly inside the runbook to reduce context switching.
Just as important as creating runbooks is maintaining them. Outdated or inaccurate docs can actively worsen an incident by sending responders in the wrong direction.
Assign ownership to each runbook and schedule quarterly reviews to ensure they stay aligned with evolving systems. But don’t wait for a quarter to go by just to review the runbooks! Any runbook used by any engineer during an incident/on-call rotation should be evaluated post-incident. Make this a ritual that the Team always performs.
Fire Drills and Simulation
Your Runbooks and Documentation will look great on paper, but nothing prepares the team for real-world events like formal practice. Fire drills and incident simulations are the dress rehearsals where teams can pressure test the quality of their alerts and their ability to respond to incidents.
The most effective simulations are realistic, time-boxed, and designed to provoke useful learning. This is not the time to crush the team’s confidence or embarrass a new engineer in the on-call rotation.
If the team is newly supporting a service, it may not be appropriate to jump straight into an outage situation. Instead, facilitate a table-top exercise that prompts discussions amongst the entire team (Recall, you don’t want only a handful of your engineers to know how to run an incident). A simple question such as: It is 8:17 AM and an Alarm has triggered indicating a drop in traffic to our primary API. What do we do?
Once the team has been able to establish their responses in low-pressure situations, multiple exercises can be performed that will provide high-quality data and feedback to the team.
Performance Testing: Running a performance test against a QA/P-1 type environment will help identify bottlenecks and performance issues as a primary function. Just as important, though, it will confirm that your alerts and dashboards are triggering as the team expected. The same is true for burst testing and endurance testing.
Simulated Outage: For these tests, work with another engineer or a dependent team to introduce a known issue into the environment (do this in the QA/P-1 environment to remove the chance of impacting real customers). This could be the removal of a DNS record or taking a system offline. Anything that disrupts the normal flow of activity within the environment.
Post simulation, meet with the full team to debrief on any/all observations. Enabling the team to lead the discussion will produce higher quality feedback - don’t make this an evaluation!
Teams that take the time to run through these activities before an incident stand a far better chance at scoring well in their ability to prevent, detect, and mitigate problems with their services. Incidents will happen.
We must accept that they will and take advantage of being “First Movers.” In this case, the first move is to be prepared for the incident by defining roles and responsibilities, ensuring a level of hygiene in our on-call rotation, ensuring that our tools are set up and configured, runbooks exist, and that we have drilled on simulated incidents.