I have recently finished reading Sidney Dekker’s “The Field Guide To Understanding Human Error” and I found it very interesting.
The book is generally about safety, but I believe the lessons learned could be applied to incident management with IT systems.
He first talks about the “Bad Apples”, this is the view that Human Error exists. One of the causes is by people doing the wrong thing, not trying hard enough or not paying enough attention and missing some significant detail. Sidney calls this the “old view”, which I see as not surprising. We, as humans tend to see our creations as perfect. These especially apply to processes we develop. We develop these complex steps that need to be followed, and when someone makes a mistake, it is their fault for not following the process. It is their fault for not trying hard enough. It is their fault for missing that crucial bit of information that could have prevented the incident.
So when you have this mindset, what is the solution? Reprimand or fire the person who made he mistake! Give the person more training? increase adherence to the process or make the processes stricter? Add more contingencies and paths to handle any situation? Especially in IT (and I’m very guilty of this one) is add more technology?
These steps according to Decker, do not work. They stop learning in their tracks. We blame the person rather thank look at the system as being “imperfect” and trying to fix that. So why do we blame people? Well, simply because it is easy and quick. You did something wrong, therefore it is “your fault”. It saves face, “The process I developed is perfect! You just didn’t follow it correctly”. Basically, it “feels right to blame someone.
The problem with blaming “someone” is that it stops learning. The investigation stops, and your done. Whereas if you continue to investigate, you will see what the underlying problems were that caused the incident in the first place.
In this view, which Dekker dubs the “new view”, human error is a symptom of some underlying problem. Something more systemic. The incident that occurred is but the start of the investigation, not the end.
First things first, you need to assume that when people come to work, they come to do a good job. If you have someone coming to work to cause havoc, then this is something different, but then you can still investigate “why did that person start causing trouble? What drove them to this?” This can be especially important for a long term employee. What drove them to do this? For a short term employee? You may ask, what did we miss with our hiring procedures?
Hindsight bias
One reason that the old view is popular is because of hindsight. You know what the effect of the actions were, because the incident already happened. We know that if you do “chown -R root /*”, you change all files to have root as the owner (OK, I admit that the command could be wrong, I’m not willing to try to verify), completely screwing up the system (yes I have done that early in my career). No it does not save time when trying to change permissions on several directories when you are in a hurry. (At least that is what I think I thought 20 years ago).
The thing is, that when the person is taking the action at the time, they don’t know what will happen. If they did – they would not have done the action that caused the incident in the first place.
What’s more, you don’t know what is going through the persons mind at the time they were performing the action. Could they be concentrating on something else that they deemed important at the time? Could their priorities be elsewhere?
For example, hypothetically, a pilot brings a plane through a storm and crashes (no one injured but lots of damage to the aircraft). Should they have flown through the storm? Obviously not, now that we know the consequences. But lets say that they were already several hours delayed. Their priority was to get the passengers back on time. Had they diverted around the storm, they would have been late and reprimanded. Had the pilots gone through the storm and nothing happened, they would have been heroes to the passengers. At the time that the decision needed to be made – without knowing the consequences, what would you have done?
That is one of the topics in the book – if you were in the same situation, under the same conditions as the people in the incident, could you have done the same thing?
If not – why not? Why are you different?
If on the other hand you could have done the same thing – what would have prevented you from doing it?
Another example Dekker goes through is a certain type of aircraft had a large number of crashes during WWII. Pilots were pulling the lever for the flaps instead of the landing gear. The two levers were near each other. They tried everything. Reprimanding pilots, re-training pilots, but the crashes continued. It wasn’t until an engineer looked at the problem in a different way. What the engineer did was glue little flaps on the levers that were for flaps, and little wheels on the lever for wheels. You see, the pilots would find the levers by touch as their concentration was focused on something else at the time during landing. Adding a tactile indicator for which lever was which pretty much eliminated the crash landings.
Recording Errors
How many of you record incidents? Do you put them into Jira, or some other bug tracker? Do you do anything with the incidents you record? or, do you just fix the immediate issues and move on, only for the incident to occur again and again?
So much so that it becomes part of your work?
Recording incidents, minor ones, even large ones does not fix the problem. Also, given that we work in an environment that is complex – computer software development, it is hard (but not impossible) to make something error free – especially given the demands of the job. But, even if something is error free, there can be circumstances that cause issues. Hardware failures, even scenarios that were never originally thought up. Users do tend to find ways to break things.
In this, Dekker has another analogy. During WWII, planes were being shot up quite a bit. Some planes made it back after a run. Some didn’t. The question was asked, should we add armor to the areas that had holes to prevent the holes in the first place? Armor adds weight and should be used sparingly.
The answer was “No”, you add armor to the areas that didn’t have holes – because they were the ones that made it back. The ones that didn’t make it back were most likely because they had holes where they shouldn’t have been.
So where should you focus your efforts in fixing problems? In the areas that cause you the most pain! Fix those, remove the pain. This should free you up for more valuable work.
Technology
Sometimes, and I’m a big sucker for this one – is that we think that replacing a person with technology will prevent the issue. The problem here is that it may fix the immediate issue, but what about boundary conditions that were never thought of? This could cause a minor issue that a human could resolve into a catastrophic issue. Dekker doesn’t say – don’t automate – but be careful what you automate. Make sure it augments the person rather than replace the person.
Technology is good for repetitive problems, but isn’t so good for changing conditions. Only a human can do that.
Conclusion
Look at the error that someone made as the start of an investigation. Look at what caused the issue in the first place, what state of mind was the person in? What was their incentives at the time? What can be done to alleviate those issues to prevent the problem in the first place. Amazon does this. While browsing Hacker News, I came across an article that goes through a talk by Jeff Bezos on how Amazon looks at incidents that occur.
The thing is that to prevent incidents, you have to rely on the people you have. Make sure they have the knowledge, experience and support to handle incidents. Make sure that they are the ones who figure out how to permanently solve the issue. They are the experts on the issue, because it is part their job. If someone makes a mistake, don’t chastise them, congratulate them for finding a flaw in the process, then look at ways to patch or repair that flaw. This may start a culture where mistakes are not hidden, but made out in the open. When they are out in the open, they can be fixed – permanently. This can only make your organization better. This – in my opinion is the crux of Agile. Finding the problems in the system and fixing them to make you faster, make you better, make you more knowledgeable and make you feel safe enough that you can expose more problems in the system.
Edit : Found the article that I referred to and added the link. I found the article on Hacker News, but it refers to a thread on Quora.