Today Bear Grylls posted these six simple words to twitter, and lots of people, including me, re-tweeted them. What is so significant, or perhaps resonant is a better word, about this idea?
Finding fault is often associated with blame, appointing a scapegoat, and generally negative outcomes. Alternatively, focusing on finding a remedy often relies on cooperation, collaboration, communication, and more positive experiences and results. While I fully agree with Bear Grylls' philosophy of finding a remedy BEFORE finding fault, there is also value in understanding exceptions, faults, or mistakes so that we can learn from them and not repeat them.
In IT, and especially as ITIL best practices have become more prevalent, the priority is emphasized on resolving incidents and getting business clients / customers back on track and working. But when you step back and consider why the incident occurred to begin with, it is helpful to consider the cause or the "fault".
In my career I've spent significant time, energy, and effort producing valuable results on determining probable or root cause of IT service exceptions. Any time something unexpected disrupts normal IT infrastructure operations - something we often refer to as an exception - it has the potential to negatively affect users and/or customers. In worst case scenarios this can include real impact to reputation or profits. Business leaders recognize theses risks to reputation and finances, and want to know what will be done to avoid or mitigate a repeat occurrence.
In order to understand how to avoid another costly incident, we do indeed need to identify the fault or the root cause. There are good frameworks available - including ITIL and MOF - to guide a team through the discovery and investigation processes. There are also numerous software companies that provide a variety of tools to automate and accelerate these processes. One element I believe is essential to success in these endeavors is capturing the early details of the incident and thorough records of how it was resolved. The goal is not to identify anyone to be reprimanded, but to objectively understand - as precisely as possible - what transpired so that it can be corrected.
Finding fault in software or a system is ideally done before it "goes live", but that in practice that doesn't always happen. If we work together to quickly uncover what happened without fear of blame or repercussion, we can partner with others to find a remedy while simultaneously learning what lead to the "fault" so that the cause can be clearly documented and remediated.