Evidence-based IT Operations

5 minute read

A significant part of our time dealing with computers in some role or other, is spent troubleshooting why something happens that differs from what we intended to happen.

Working with less experienced colleagues and frequenting webforums populated by technicians of mixed levels of experience, I see some common anti-patterns repeat, and I thought I’d make an attempt at presenting a framework to help get into the correct mindset for troubleshooting.

It may not come as a surprise that the correct mindset when troubleshooting an engineering problem, is a scientific one. In this post I’m walking you through some commonly agreed steps of the scientific method, and applying them to troubleshooting an IT system. Please note that what I describe in this post is a single iteration of something that is likely to actually manifest itself as a continuous and ongoing process that can drag on for an extended period of time.

The scientific method

The scientific method is often described as a sort of feedback loop:

  • You observe something
  • You research the topic
  • You develop a hypothesis about what you’ve seen and how it works
  • You test your hypothesis experimentally
  • You analyze the data from your test
  • You report your conclusions

Let’s go through each of these steps in terms of running an IT environment.

Observation

This is usually what starts a troubleshooting session: “Something is wrong”. But realizing that is just the first step of making an observation: Exactly what is wrong?

I can’t overstate the importance of thinking of the obvious in this step. Time and time again we find ourselves thinking “it’s such a serious symptom, the problem must be complex and hard to identify”. Always start out by looking at the obvious: Is there an error message on the screen? What does it say? What do the system and application logs say? In most cases the problem you’ve encountered does not have a rare or obscure root cause. The application or system developers will have thought of adding error messages for when common things go wrong. Follow those up.

The other big thing to take into consideration here, is to note - in writing if possible - what actually happens. What are the symptoms you’re seeing? What do you notice that surprises you or doesn’t appear to be normal?

Both of these categories of observations follow us into the next step.

Research

A huge part of troubleshooting consists of comparing the observed behavior to the expected one.

That sometimes puts us in a bit of a conundrum: What, actually, is the expected behavior? This is where we start digging into documentation and scouring the web for explanations.

Even seasoned veterans in the IT Operations field will never be experts in all possible fields they touch. What we gain from experience, is a good gut feeling for how a developer might put a system together to solve a certain real-world problem, or an understanding for underlying technologies and protocols. In the latter case an experienced technician knows what a system that’s dependent on or manipulates these has to be doing under the hood. But having this gut feeling or knowledge does not eliminate the need to verify that our mental model of how the program works actually fits reality.

Developing a hypothesis

Based on our observations of what the program seems to be doing and our research into what it should be doing, we can begin drafting a hypothesis about the issue we’ve encountered. We’ve surely already built a gut feeling about what the underlying problem is, but a good hypothesis goes further than that: The sum of the observations and knowledge we’ve gathered so far can be used to eliminate some options as to where the problem lies, and it can be used to define a sort of weighted list of the most likely areas for where to search for and eliminate possible causes for the problem.

Our hypothesis is an informed guess about what the underlying problem is, and how to resolve it.

Testing our hypothesis

Once we have a hypothesis to work from, we should ensure that our test of it actually tests the hypothesis and nothing else. When working on the problematic system we should be aware that what we’ve built so far is only a hypothesis: We have no guarantee that it will fix the issue, and we don’t know that we won’t make things worse in some way. It’s important that we save or at the very least document the current state of the system before making any changes. We should also be careful to make minimal changes at a time, so as not to introduce additional variables into our original issue. Finally we should be sure to document the changes we made and when they were made, as that will be important to know later.

Analyzing the data from our test

In its simplest form, this step consists of doing whatever we know (or hypothesize) triggers the issue we’re having, and checking whether the test we performed based on our hypothesis seems to have resolved the problem or at least whether it has made any difference to the expected outcome.

If we’re in a technologically mature environment, this step may be aided by monitoring software, log collectors and other fancy tools that assist in gathering relevant data and making it all available at a glance, but lacking that, it’s important to understand that all of those tools are only quality of life improvements: The data you’re looking for does exist, but it may be more or less of a chore to gather it and make sense of it.

If the analysis step shows that the problem was not resolved, bring the result of your analysis into the next iteration as part of your observations.

Report your conclusions

The difference between screwing around and science is writing it down.

Adam Savage, Mythbusters

Keep track of the issue, your observations, your hypotheses, what actions you’ve taken, and their results. Also try to timestamp the events. This will help you personally better understand what went wrong and remember how you fixed it, as in most cases you or a current or future colleague will encounter similar situations later on. If you work in a team, it’s a good idea to establish a culture of blameless postmortems: Walk through what happened step by step. Don’t exaggerate your heroics, and don’t sweep your mistakes under the rug. The point of this exercise is to learn from what happened, to extend your documentation and procedures as necessary, and to become better at what you do.