Development teams that strive to improve the reliability of their code continuously should always perform an After Action Review (AAR) following any incident or unscheduled outage - regardless of whether it causes an impact to the customer.

AARs have many synonyms, but the objective behind them is always the same:

Put together a blame-free, detailed description of exactly what went wrong in order to cause the issue, along with a list of steps to take in order to prevent a similar incident from occurring again in the future.

The only way to prevent repeating the same mistake is to recognize what we are doing right and where we can improve. A single-action rarely causes an incident, but rather a series of events that together created a problem significant enough to cause disruption.

Owner Responsibilities

Following an incident, the team member (incident owner) who ran point is generally responsible for populating the AAR, looking up logs, managing the follow-up investigation, and keeping all interested parties in the loop.

The incident owner is responsible for the following:

  • Schedule the post-mortem meeting and inviting the relevant people
  • Investigating the cause of the incident
  • Update the AAR page with all of the necessary content
  • Creating follow-up tickets (responsible only for their creation)
  • Ensure that the SCRUM Master, Product Owner, and Team Leads are aware of the ticket numbers, so they are prioritized appropriately
  • Review the post-mortem content with appropriate parties before the meeting
  • Running through the topics at the post-mortem meeting
  • Recap the timeline to make sure everyone agrees and is on the same page
  • Recap important points, and any unusual items
  • Discuss how the problem could have prevented
  • Could it have been caught in testing?
  • Discuss customer impact.
  • Review action items that are created discuss if appropriate, add more if needed
  • Communicating the results of the post-mortem internally

Writing the AAR

Writing an effective after-action review allows teams to learn quickly from mistakes and improve systems and processes for everyone.

When writing an AAR, you want to be sure you are writing detailed and accurate material to get the most benefit out of them. This guide lists some of the things we can do to make sure our AARs are effective.

Do's

  • Make sure the timeline is an accurate representation of events.
  • Describe any technical lingo/acronyms you use that newcomers may not understand.
  • Discuss how the incident fits into our understanding of the health and resiliency of the services affected.
  • How did we view the health of the service involved before the incident?
  • Did this incident teach us something that should change our views about this service's health?
  • Was this an isolated and specific bug—a failure in a class of problem we anticipated—or did it uncover a type of issue we did not architecturally expect in the service?
  • Do we think an incident akin to this one will happen again if we don't take more extensive systemic action beyond the action items captured here?
  • Will this class of issue get worse/more likely to happen as we continue to grow and scale the use of the service?
  • Was there a previous incident that showed early signs pointing to this one?
  • There are also some things to clarify about what we think we are accomplishing with the action items we are taking.
  • Are we dealing with a specific issue immediately in a narrow, targeted way?
  • Are we taking action to eliminate what we see as an entire class of potential issues?
  • Not taking action, because more significant efforts are already underway and will rapidly obsolete a targeted fix? (If so, those more significant efforts should be called out!)
  • Not taking significant action because we don't think it's justified?

Dont's

  • Don't use the word "outage" unless it was an outage. We want to be sure we accurately reflect the impact of an incident, and "outage" is usually too broad a term to use. It can lead customers to think we were entirely unavailable when that likely was nowhere near the case.
  • Change details or events to make things "look better". We need to be honest in our post-mortems, even to ourselves; otherwise, they lose their effectiveness.
  • Name and shame someone. We keep our post-mortems blameless. If someone deployed a changed that broken things, it's not their fault, it's our fault for having a system that allowed them to implement a breaking change, etc.

Suggestions

  • Avoid the concept of "human error". This is related to the point above about "naming and shaming," but there's a subtle difference. Very rarely is the mistake "rooted" in a human performing an action, there are often several contributing factors (the script the human ran didn't have rate limiting, the documentation was out of date, etc...) that can and should be addressed.
  • Avoid the "alternate reality" discussion in the timeline or description sections. Below are two examples which blend describing the actual problem with a hypothetical fix - keep the improvements separate from the description, so that each can be appropriately discussed.

eg. "Service X started seeing elevated traffic early this morning and stopped responding to requests. If service X had rate limited the requests from the customer, it would not have failed."

eg. "Service X began slowly responding to requests this evening, there was insufficient monitoring to detect the elevated CPU usage."

These videos go into more detail on the above points,

Reviewing

Get the input of coworkers before the review meeting.

Here are some things to ask:

  • Does it provide enough detail?
  • Rather than just pointing out what went wrong, does it drill down to the underlying causes of the issue?
  • Does it separate “What Happened?” from “How to Fix it”?
  • Do the proposed action items make sense? Are they well-scoped enough?
  • Is the post-mortem well written and understandable?
  • Does the external message resonate well with customers or is it likely to cause outrage?

Reviewing an AAR isn't about nit-picking typo's (although we should make sure our external message isn't littered with spelling errors). It's about providing constructive feedback on valuable changes to an AAR so that we get the most benefit from them.

Other Resources

Blame. Language. Sharing. | Fractional by Lindsay Holmwood
Failure can lead to blame or inquiry in your organisation.