In the nuclear power industry, the primary mission of a root-cause investigation is to understand how and why a failure or a condition adverse to quality has occurred so that it can be prevented from recurring. This is a good practice for many reasons—and a lawful requirement mandated by 10CFR50, Appendix B, Criterion XVI.
To successfully carry out this mission, a root-cause investigation needs to be evidence-driven in accordance with a rigorous application of the bedrock of all root-cause methodologies: the Scientific Method. Consistent with the Scientific Method, underlying assumptions have to be questioned and conclusions have to be consistent with the available evidence, as well as with proven scientific facts and principles.
Sometimes root-cause investigations fail to fulfill their primary mission and the failure recurs. In that regard, diagnosing the root cause of root-cause investigation failures is, in itself, an interesting topic. Here are three common reasons why some root-cause investigations fail their mission.
Reason #1: The Tail Wagging the Dog
As a root-cause investigation proceeds and information about the failure event accumulates, some initial hypotheses can be readily falsified by the preliminary evidence and dismissed from consideration. The diminished pool of remaining hypotheses will likely have some attributes in common. More work is then usually needed to uncover additional evidence to discriminate which of the remaining hypotheses specifically apply.
At this point in the investigation, it may become apparent what the final root cause might be—especially if the remaining pool of hypotheses is small and they all share several important attributes. At the same time, it also becomes apparent what the corresponding corrective actions might be.
By anticipating which corrective actions are more palatable to the client or management, the investigator may begin to unconsciously—or perhaps even consciously—steer the remainder of the investigation to arrive at a root cause whose corresponding corrective actions are less troublesome.
Evidence that appears to support the root cause and lead to more palatable corrective actions is actively sought, while evidence that might falsify the favored root cause is not actively sought. Evidence that could falsify a favored root cause may be dismissed as being irrelevant or not needed. It may be tacitly assumed to not exist, to have disappeared or to be too hard or too expensive to find. It may even just be ignored because so much evidence already exists to support the favored root cause that the investigator presumes he already has the answer.
In logic, this is defined as an a priori methodology. This is where an outcome or conclusion is decided beforehand, and the subsequent investigation is conducted to find support for the foregone conclusion. In this case, the investigator has decided what corrective actions he wants based on convenience to his client or management. Subsequently, he uses the remainder of the investigation to seek evidence that points to a root-cause that corresponds to the corrective actions he desires.
What Really Happened: Failure Of A Zener Diode
This X-ray radiograph shows a 1N752A-type Zener diode that was manufactured without a die-attach at one end of the die, and with only marginal die-attach at the other end. This die-attach defi ciency caused the component to fail unexpectedly in an intermittent fashion. In turn, this led to a failure in the voltage regulator system of an emergency diesel generator system, causing it to be temporarily taken out of service.
The failure of this Zener diode occurred in a circuit board that had seen less than 40 hours of actual service time, although the circuit board itself was over 27 years old. It had been a spare board kept in inventory.
Going to this level of detail to gather evidence might seem extreme. This particular evidence, however, was fundamental to validating the hypothesis that the rootcause in this case was a random failure due to a manufacturing defect, and falsifying the hypothesis that the failure was caused by an infant mortality type failure. In the nuclear power industry, this distinction is significant.
Here is an example: A close-call accident involved overturning a large, heavy, lead-lined box mounted on a relatively tall, small-wheeled cart. The root-cause investigation team found that the box and wheeled cart combination was intrinsically unstable. The top-heavy cart easily tipped when the cart was moved and the front wheels had to swivel, or when the cart was rolled over a carpet edge or floor expansion joint.
The investigation team also found that the personnel who moved the cart in the course of doing cleaning work in the area had done so in violation of an obviously posted sign. The sign stated that prior to moving the cart a supervisor was to be contacted. The personnel, however, inadvertently moved the cart—without contacting a supervisor—in order to clean under and around it.
The easy corrective actions in this case would be to chastise the personnel for not following the posted rules and to strengthen work rule adherence through training and administrative permissions. There is ample evidence to back-fit a root cause to support these actions. Also, such a root-cause finding—and its corresponding corrective actions—are consistent with what everyone else in the industry has done to address the problem, as noted in ample operational experience reports. In the nuclear power industry, the "bandwagon" effect of doing what other plants are doing is very strong.
In short, the aforementioned corrective actions are attractive because they appeal to notions of personal accountability, are cheap to do and can quickly dispose of the problem. Consequently, the root cause of the close-call accident was that the workers failed to follow the rules.
Unfortunately, when the cart and box combination is rolled to a new location, the same problem could recur. The procedure change and additional training might not have fixed the instability problem. While the new administrative permissions and additional training could reduce the probability of recurrence, they would not necessarily eliminate it. When the cart is rolled many times to new locations, it is probable that the problem will eventually recur and perhaps cause a significant injury. This situation is similar to the hockey analogy of "shots on goal." Even the best goalkeeper can be scored upon if there are enough shots on goal.
Reason #2: Putting Lipstick on a Corpse
In this instance, a failure event has already been successfully investigated. A root cause supported by ample evidence has been determined. Vigorous attempts to falsify the root-cause conclusion have failed. Ok…so far, so good.
On the other hand, perhaps the root-cause conclusion is related to a deficiency involving a friend of the investigator, a manager known to be vindictive and sensitive to criticism or some company entity that, because of previous problems, can't bear criticism. The latter could include an individual that might get fired if he is found to have caused the problem, an organization that might be fined or sued for violating a regulation or law or a department that might be re-organized or eliminated for repeatedly causing problems. In other words, the root-cause investigator is aware that the actual consequences of identifying and documenting the root cause may be greater than just the corrective actions themselves.
When faced with this dilemma, some investigators attempt to "word-smith" the root-cause report in an eff ort to minimize perceived negative findings and to emphasize perceived positive findings. Instead of using plain, factually descriptive language to describe what occurred, less precise and more positive- sounding language is used. This is called "word-smithing" a report.
"Word-smithed" reports are relatively easy to spot. Instead of using plain modifiers like "deficient" or "inadequate" to describe a process, euphemistic phrases like "less than sufficient" or "less than adequate" are used. Instead of reporting that a component has failed a surveillance test, the component is reported to have "met 95% of its expected goals." Likewise, instead of reporting that a fire occurred, it is reported that there was a "minor oxidation-reduction reaction that was temporarily unsupervised."
In such cases, the root-cause report becomes a quasi-public relations document that sometimes has conflicting purposes. Since it is a root-cause report, its primary purpose is supposed to be a no-nonsense, fact-based document that details what went wrong and how to fix it. However, a secondary, perhaps conflicting, purpose is introduced when the same document is used to convince the reader that the failure event and its root cause are not nearly as significant or serious as the reader might otherwise think.
With respect to recurrence, there are two problems with "word-smithing" a root-cause report. Corrective actions work best when they are specific and targeted. A diluted or minimized root-cause, however, is oft en matched to a diluted or minimized corrective action. There is a strong analogy to the practice of medicine in this instance. When a person has an infection, if the degree of infection is underestimated, the medicine dose may be insufficient and the infection may come back.
The second problem is that by putting a positive "spin" on the problem, management may not properly support what needs to be done to fix the problem. Thus, the report succeeds in convincing its audience that the failure event is not a serious problem.
Reason #3: Elementary My Dear Watson
In some ways, root-cause investigations are a lot like "whodunit" novels. Some plant personnel simply can't resist making a guess about what caused the failure in the same way that mystery buffs often try to second guess who will be revealed to be the murderer at the end of the story. It certainly is fun for a person—and perhaps even a point of pride—if his/her initial guess turns out to be right. Unfortunately, there are circumstances when such a guess can jeopardize the integrity of a root-cause investigation.
The circumstances are as follows:
The deficiency in this scenario that can lead to recurrence is the fact that falsification of the favored hypothesis was not pursued. Once a cause was presumed to have been found, significant evidence gathering ceased. (Why waste resources when we already have the answer?) As a result, evidence that may have falsified the hypothesis, or perhaps supported an alternate hypothesis, was left in the field. Again, this is another example of an a priori methodology: where the de facto purpose of the investigation is to gather information that supports the favored hypothesis.
In this regard, there is a famous experiment about directed observation that applies. Test subjects in the experiment were told to watch a volleyball game carefully because they would be questioned about how many times the volleyballs would be tipped into to air by the participants. This they did.
In fact, the test subjects did this so well, they ignored a person dressed in a gorilla suit who sauntered through the gaggle of volleyball players as they played. When the test subjects were asked about what they had observed, they all reported dutifully the number of times the ball was tipped but no one mentioned the gorilla. When they were told about the gorilla, they were incredulous and did not believe that they had missed seeing a gorilla…until they were shown the tape a second time. At that point, they all observed the gorilla. MT