Nobody likes to hear the dreaded klaxon startling you awake in the middle of the night. When the call does ring out desperately for a human’s help, what do you do? After dealing with a 24/7 type on call for the better part of 18 years, here’s a few practices I do to keep my mind in the game during on call rituals.
Step 1: Don’t panic
It is easy to succumb to panic in a heightened situation, especially if there is 100 alarms going off at the same time. The best thing to do in this situation is to not panic. A mind that is subdued with panic will make hasty decisions that may not solve problems and worse, make new problems. A clear mind is one that can see all the pieces on the board at a time. Even if its 100 alerts.
Step 2: Find Simple Common Denominators
Let’s briefly touch on a situation where you are receiving a dozen alerts. They all fire in rapid succession and they don’t all seem related at first glance. Spend the first 5 minutes in your incident response to find a commonality between the alerts. Coincidences rarely happen at the same time so the odds are there is something related happening. It may be a common DNS server having issues or an issue with an adjacent container or pod. It could be something more complex, but it helps make the job smoother to rule out the simple common denominators before diving deeper and wasting time troubleshooting everything else.
Step 3: Know When to Call for Backup
Engineers shouldn’t feel alone when the call from your monitoring platform rings out. Having a decent backup plan in place if additional troubleshooting support is required is a critical component to a successful operation. Engineers work better in team settings to allow ideas, talking through problems, and brainstorming to flow freely and in an open troubleshooting space.
Know when to recognize when to ask for help. If you feel like the issue is alluding you and troubleshooting methods are not working, sometimes its better to white board, mind map, or just rubber ducky a problem out. Another point of view utilizing these methods typically adds new perspective to the issue and the solution may just lie in that interaction.
Step 4: Know Your Monitoring Toolchain
Chances are you are running some sort of application with many moving parts on the Internet. Knowing your monitoring and alerting toolchain well brings an edge to observability of issues within your application. It can also help you narrow down problems in the application faster providing a better response time and recovery time to your platform. It’s important to know how to find the problems as the alarms come in to triage the issue and find any simple trends in data or common denominators between issues. Having the agility and familiarly of the monitoring toolchain brings a fluidity to troubleshooting skills utilizing the tools available, rather than struggling to find components about on the monitoring platform.
Step 5: Know Your Stack
Let’s face it. Your software stack probably has a lot of intricacies. Most cloud products do. Create diagrams as information is learned about your infrastructure. Document and take notes on as much as you can for future reference. Come up with decent information filing methods that your team can adopt and contribute to. A well documented infrastructure can be an invaluable tool. It adds an extra layer of agility during problem solving time, brings forth new ideas with more questions answered in the first step, and adds sanity to infrastructure. Knowing your software stack well is a critical component for success.
I truly hope I could add a bit of sanity to a chaotic concept called on call. It is certainly not for the faint of heart. After 18 years of the potential of being called to action at any moment, the 5 steps above should help you have a smooth and (somewhat) enjoyable on call.