Single Fault Assumption vs. Multiple Faults in Practice
This page examines diagnosis in the presence of multiple faults as part of the white paper
A Guide to Fault Detection and Diagnosis.
In large complex operations (such as the control centers for refineries, or for network management), there are usually multiple outstanding problems. Walk into almost any control center and you will see multiple red messages, and those are just the ones related to the critical outstanding problems.
While the probability of simultaneous failure is slim, the probability is high that there hasn’t been time to fix all the previous problems. This can create serious problems in isolating faults because no single fault will result in all of the abnormal symptoms that are observed. (Fault detection will still be achieved.) Fault isolation is the most difficult when the multiple faults result in overlapping symptoms.
Multiple faults also introduce the problem of “fault masking”: the presence of one fault may make it impossible to even see symptoms to detect or isolate some other faults. For example, when diagnosing automobile faults, if the fuel line is partly blocked, you will not be able to detect or diagnose most engine faults that are only observable at high speed, because the engine will not be able to achieve high speed. Masking is often possible when diagnosing faults that have a major impact on system operation, or when the fault is a sensor failure. An example of masking and the effects of a single fault assumption is given in the discussion of causal models.
When diagnosing multiple faults, systems often attempt to pick the smallest number of faults that explain the observed symptoms; a form of “Occams razor”, or the “parsimony principle”.
Diagnostic systems that make a single fault assumption do poorly at fault isolation when faced with multiple faults. This includes pattern matching using classification of fault signatures, and systems based on Bayes rule. This is because no single fault will explain all of the observed abnormal symptoms. No single fault signature will be close to the combined fault signature of two faults.
When using diagnostic techniques based on a single fault assumption, special consideration needs to be given to ameliorate the problem. At a minimum, the system should recognize and notify users that fault isolation results may be wrong when no single fault signature is matched. Detecting the situation is easy when explicitly using fault signatures or quantitative models, but may be difficult when using black box classifiers.
There are many possible approaches to partly work around fault isolation based on a single fault assumption. One approach is to explicitly treat multiple failures as another single fault. This does not scale well to many faults, though, and will violate any assumptions about fault independence.
The following approaches are more practical: Multiple problem diagnosers might each be limited to a very small portion of the overall system, giving up knowledge based on interactions with other portions in the hope of improving robustness. Another approach is to have multiple observers running in parallel, each one sensitive to symptoms of particular faults, and only using “positive” evidence of their particular fault (ignoring symptoms that are evidence against it, because those symptoms might be due to another fault). Another approach is to account for that fact that the multiple failures normally occur in sequence: modify the structure of the fault isolator after the first failure is diagnosed, to ignore symptoms which are already explained.
If it weren't for these partial workarounds, fault isolation techniques based on a single fault assumption would be essentially useless for large-scale systems such as in process monitoring or network management. Many systems have a number of small-scale, ad-hoc diagnosers, for example. Also, model-based systems that support predictions (such as causal models) should at least support sequential failures. For detection, they would at a minimum account for duplicate messages due to already-known failures even if fault isolation for subsequent faults fails. That is also accomplished in alarm filtering and event correlation.
Copyright 2010 - 2013, Greg Stanley
(Return to A Guide to Fault Detection and Diagnosis)