Fault Trees and Related Models
This page examines fault trees and related models. This is part of the section on
Causal Models that is part of the section on
Model Based Reasoning that is part of the white paper
A Guide to Fault Detection and Diagnosis.
Some binary-valued, directed graph cause/effect models have the additional restriction that they are in the form of a tree. That is, no cycles are allowed (even ignoring directions on the arcs). One example is the Ishikawa “fishbone” diagram used in Statistical Quality Control. The “Root Cause Analysis” (RCA) popular in maintenance organizations is another example. Fault tree models are popular in safety analysis and nuclear plant monitoring. They generally have the additional feature that probabilities are calculated at each node.
The “top” node in these models is often used to indicate a problem such as loss of a critical function like cooling (for instance, as indicated by a temperature.) Diagnosis then proceeds to test each of the possible causes, going through the fault tree nodes until a root cause is identified.
Fault trees are popular in industries such as the nuclear power industry. They are considered for use in online diagnosis partly because they are already developed and reviewed for safety and general risk analysis reasons.
Fault trees are commonly used for offline analysis. One of the main points in constructing the Ishikawa “fishbone” diagrams or an RCA diagram is providing a mechanism for organizing and recording the results of team effort in analyzing a single problem that has already occurred.
One problem with using these same fault trees for online diagnosis is that there is a separate tree for each defined top-level problem. The result is that some of the same variables and logic are represented in multiple trees, often without any guarantees that they are consistent. This presents an application maintenance headache even if the duplication is noticed. It is better to have each possible fault, propagation paths, and each sensor value, exist in just one place, rather than duplicated through multiple fault trees.
Tools may be provided to help manage this. But it may still be hard to make full use of all the sensor information available, because cross-links between the trees are not represented, even though these links exist in a full cause/effect model. Those links could provide cross-checks and help avoid dependence on single sensors for all conclusions.
Fault trees can have problems (such as calculating probabilities) in the case of "common mode" failures, where the linkage between failures that enter the trees at more than one point are not really "understood" within the diagram. In nuclear applications, there are always possible common mode failures, generally due to utilities problems such as loss of power or cooling water. Processing outside of the main fault trees algorithms may be used to account for this.
When fault trees are already available, these often emphasize the extreme cases, since they have been the most studied because of safety or risk analysis requirements. They might not already be developed for more routine, less critical problems that may be more common and harder to diagnose. An example was in the initial fault management ideas for the Iridium satellite-based global phone system. Because many fault trees were available for the satellites, they were built into the system. But those fault trees were available because of risk analysis. The most-studied cases mostly had the same conclusion: de-orbit the satellite. These extreme cases were not terribly useful for more routine operations!
In cases where the fault trees don't already exist, it may be better to develop more general cause/effect models that really allow full use of data and interactions, and provide for more opportunity to recognize bad sensors based on all the conclusions that would be derived from them. Even in cases where the fault trees do exist, it may be worth the effort to use them as the basis for a more complete, unified causal model where every variable and link is just represented once. If this implies that the online diagnosis might not have the same rigorous probability calculations as the risk analysis fault trees, that might be worth the tradeoff.
Note that the structure of fault trees can easily be derived from causal models, but that assembling a causal model from a collection of fault trees would not be as easy, and would likely be missing significant links not represented in the fault trees.
The classic references covering fault trees focus on nuclear applications. Besides methodology and nuclear examples, they also give excellent overviews of the probability calculations associated with fault trees.
The definitive, classic textbook is still available:
Probabilistic Risk Assessment and Management for Engineers and Scientists, Second Edition, Kumamoto and Henley, IEEE Press, 1996 .
The US NRC (Nuclear Regulatory Commission) published a complete 1981 text online for free as a pdf file:
Fault Tree Handbook (NUREG-0492) .
Copyright 2010 - 2013, Greg Stanley
(Return to causal models)