Let’s Talk About On Call

Nobody likes to hear the dreaded klaxon startling you awake in the middle of the night. When the call does ring out desperately for a human’s help, what do you do? After dealing with a 24/7 type on call for the better part of 18 years, here’s a few practices I do to keep my mind in the game during on call rituals.

Step 1: Don’t panic

It is easy to succumb to panic in a heightened situation, especially if there is 100 alarms going off at the same time. The best thing to do in this situation is to not panic. A mind that is subdued with panic will make hasty decisions that may not solve problems and worse, make new problems. A clear mind is one that can see all the pieces on the board at a time. Even if its 100 alerts.

Step 2: Find Simple Common Denominators

Let’s briefly touch on a situation where you are receiving a dozen alerts. They all fire in rapid succession and they don’t all seem related at first glance. Spend the first 5 minutes in your incident response to find a commonality between the alerts. Coincidences rarely happen at the same time so the odds are there is something related happening. It may be a common DNS server having issues or an issue with an adjacent container or pod. It could be something more complex, but it helps make the job smoother to rule out the simple common denominators before diving deeper and wasting time troubleshooting everything else.

Step 3: Know When to Call for Backup

Engineers shouldn’t feel alone when the call from your monitoring platform rings out. Having a decent backup plan in place if additional troubleshooting support is required is a critical component to a successful operation. Engineers work better in team settings to allow ideas, talking through problems, and brainstorming to flow freely and in an open troubleshooting space.

Know when to recognize when to ask for help. If you feel like the issue is alluding you and troubleshooting methods are not working, sometimes its better to white board, mind map, or just rubber ducky a problem out. Another point of view utilizing these methods typically adds new perspective to the issue and the solution may just lie in that interaction.

Step 4: Know Your Monitoring Toolchain

Chances are you are running some sort of application with many moving parts on the Internet. Knowing your monitoring and alerting toolchain well brings an edge to observability of issues within your application. It can also help you narrow down problems in the application faster providing a better response time and recovery time to your platform. It’s important to know how to find the problems as the alarms come in to triage the issue and find any simple trends in data or common denominators between issues. Having the agility and familiarly of the monitoring toolchain brings a fluidity to troubleshooting skills utilizing the tools available, rather than struggling to find components about on the monitoring platform.

Step 5: Know Your Stack

Let’s face it. Your software stack probably has a lot of intricacies. Most cloud products do. Create diagrams as information is learned about your infrastructure. Document and take notes on as much as you can for future reference. Come up with decent information filing methods that your team can adopt and contribute to. A well documented infrastructure can be an invaluable tool. It adds an extra layer of agility during problem solving time, brings forth new ideas with more questions answered in the first step, and adds sanity to infrastructure. Knowing your software stack well is a critical component for success.

I truly hope I could add a bit of sanity to a chaotic concept called on call. It is certainly not for the faint of heart. After 18 years of the potential of being called to action at any moment, the 5 steps above should help you have a smooth and (somewhat) enjoyable on call.

2 thoughts on “Let’s Talk About On Call”

/dev/nall says:

March 3, 2023 at 2:42 pm

Great read, Nick! My experiences very much line up with yours.
Two thoughts that occurred to me:
1. “Coincidences do not happen randomly at the same time” this is true…until it isn’t. It’s unlikely that two unrelated incidents pop up at the same time but it’s definitely not impossible. I’ve gotten burned once or twice trying to find the link between two seemingly unrelated issues only to eventually realize that they were, in fact, unrelated. The probability is low but as the system/platform grows more complex, it increases.
2. I agree that it’s important to “Know Your Stack” but one side effect of large, complex systems is that it becomes increasingly more difficult for a single person to hold all of the knowledge. I’d also posit that the popularity of microservice architectures has made this even more difficult. But I think you hit the nail on the head when you talk about the importance of documentation (and having it be searchable/findable in a crisis). In addition to the architecture docs and diagrams that you mention, I think one of the best things that you can do for on-call is to constantly be improving the documentation/explantion inside the pages themselves as well as runbooks for common failure conditions (ideally linked from within the page).

1. nick says:
  
  March 3, 2023 at 5:05 pm
  
  Thanks for contributing your thoughts on this! One thing that has always been certain in my career is the great resource your peers can bring to the table in “5 alarm fire” situations. The culture of collaboration is king in the engineering world and I am thrilled to have spent some time collaborating with you.