|
The example above starts with all diagnostics states unknown. The first observation is that H is true. Since there is only a single input to H, F can be inferred to be true. As a result, there is an ambiguity group consisting of C1, C2, and C3. At least one of these suspects must be true. Next, the node I is observed to be false. Hence, both G and C3 can be inferred to be true. Finally, a direct observation indicates that C2 is false. Since the inputs from C2 and C3 to node F are now believed to be false, the only remaining possible true input to F must be C1. So, we conclude that C1 must be inferred true. Diagnosis is complete. And, we can now predict that D must be true as well.
Conflicting data
The examples above do not demonstrate what happens when there is conflicting data that is inconsistent with the model. For instance, in the above example, what if D were observed to be true, but F was observed to be false? Conflicts could arise because of model errors, measurement errors, or timing errors if there are incorrectly modeled time delays. Several strategies are possible. The approach taken in SymCure was to always believe the most recent input, and only override existing values when needed to resolve the conflict. Other approaches perform numerical combinations of data to estimate probabilities or other metrics for each possible root cause. Most of those make the additional assumptions that there are no time delays, and that there is only a single fault. For instance, calculate “distances” between symptom vectors to pick the single fault with the closest match to the observed symptoms. When some symptom values are “unknown”, the calculations can just ignore that symptom.
Using “a priori” estimates of the probability of a fault
With some techniques (such as Bayes rule), “a priori” probabilities of failure (prior estimates) may provide help for guessing the most likely root cause. For instance, consider the ambiguity group C1 and C2 in the “E2 observed true” case of the first picture of “Examples of diagnosis”. Suppose we can get no further data. We can still provide useful diagnostic information: If we know from historical failure data that one failure is twice as likely as another, and we have to pick one, we pick the one with the highest a priori probability. Or, we can report estimated probabilities.
Test planning
In all the examples above, we just inserted values into nodes and propagated values upstream and downstream. But unless the monitoring system is completely automated and completely passive, values will not just show up randomly or periodically based on scanning of data. They may also be the result of tests. Tests can be questions asked of end users, requiring a manual input. They can also be be data acquisition or the result of automated workflows that are only run when needed. Planning to decide which tests to ask for is an essential part of the engine for a diagnostic package. This is described in more detail in the section on tests.
Causality in reality implies time delays and lags
In physical systems, causality is in reality associated with some time delay or lags between cause and effect. This has to happen because mass or energy has to move, overcoming resistance by inertia, thermal inertia, inductance, or other physical phenomena. This is discussed on the page Causal Time Delays and Lags.
Commercial use of binary causal models
Binary qualitative cause/effect models are in widespread use. The Symcure CDG example was already cited. The SMARTS InCharge product that was popular in network management applications used binary causal models for application development. But then it compiled them into fault signature patterns (vectors of symptoms present for a given root cause fault) for pattern matching at run time. The popular fault tree and related models are a special case discussed next.
Fault trees and related models
Some binary-valued, directed graph cause/effect models have the additional restriction that they are in the form of a tree. That is, no cycles are allowed (even ignoring directions on the arcs). One example is the Ishikawa “fishbone” diagram used in Statistical Quality Control. The “Root Cause Analysis” (RCA) popular in maintenance organizations is another example. Fault tree models are popular in safety analysis and nuclear plant monitoring. They generally have the additional feature that probabilities are calculated at each node. Fault trees are commonly used for offline analysis, and discussed on the separate page Fault trees and related models.
Other variations of qualitative cause/effect models
There are further variations of qualitative cause/effect models. Some model just binary variables: the presence or absence of faults and the resulting symptoms, as in the case of the CDG (Causal Directed Graph) models. Others incorporate a sign: In the SDG (“Signed Directed Graph”) models, variables are either high, low, or normal. The propagation of faults from variable to variable then includes a sign to indicate whether a high input causes a high or a low effect. The AI community has developed techniques based around “qualitative physics”, which also are signed digraph models.
SDG models in principle could do a better job of fault isolation because more information is included than in the unsigned case. However, SDG techniques have more problems with loops in the model, and cancellation of effects, than the unsigned directed graph models. Process plants have a prevalence of material recycle loops; energy transfer in feed/product heat exchangers; feedback loops caused by exothermic reactions, temperature, and increasing catalyst activity with temperature; and most importantly, the presence of numerous feedback control loops. Probably partly because of that, SDG-based techniques have not been popular in practical applications, although there was quite a bit of research around them.
There has traditionally been an assumption that the theoretical increased fault isolation capability of SDG models (or Bayesian models) compared to simple binary models is always important. However, in application areas such as operations management in the process industries or network management, we have already noted that there is generally ample instrumentation for fault isolation. The major benefit of model-based reasoning is in providing a principle for organizing the diagnostic process automatically, given someone with the domain knowledge to build a model. For these industries, wider, reliable applicability without worrying about special cases is generally more important than the few extra measurements that might need to be automatically used.
There are further variations in how uncertainty or evidence combination is handled. For instance, the Symcure CDG example already cited supported propagation of fuzzy values through simple minimum and maximum operations for “AND” and “OR”, respectively.
Quantitative causal models
Causal models can also be quantitative. Strictly speaking, models based on differential or difference equations are causal, as are the equivalent signal flow graphs. The causes are changes in the input variables, which then propagate over time through the equations, especially obvious when written as difference equations.
Algebraic models in the form g(x) = 0 (such as the constraints associated with data reconciliation) are definitely not causal. However, when these rewritten in input/output form such as y = f(x), where x is a vector of inputs to some equipment or unit, and y is the vector of outputs, causality is represented in a way and can be used. The input/output form is really an approximation ignoring dynamics, but we know that in reality there are time delays and lags between input changes and output changes.
An example approach to pipeline monitoring using causal models and quantitative models
The concept paper Pipeline Diagnosis Emphasizing Leak Detection: An Approach And Demonstration outlines an approach to pipeline leak detection that combines causal models of abnormal behavior with both static (algebraic) models and dynamic models.
Copyright 2010-2020, Greg Stanley
(Return to Model Based Reasoning)
(Return to A Guide to Fault Detection and Diagnosis)
|