Passive System Monitoring vs. Active Testing
This page examines the distinction between passively monitored symptoms and strategies for determining non-routine tests, as a part of the white paper A Guide to Fault Detection and Diagnosis.
In the case of online monitoring systems, many diagnostic techniques assume routine scanning of every variable of interest, or that subsystems such as agents just send events when there is a problem. This can be called “passive monitoring”, where the data or event just shows up as input to the diagnostic system once the system integration is established. However, there are many situations in which this will not be the case. We make a distinction between “symptoms” that are based on routine scanning (and events generated based on that), and non-routine “tests” that must be requested.
These additional tests are only run when needed for fault isolation, because they require additional resources such as time, material cost, labor cost, CPU load, or network traffic, or because they introduce some disturbances into normal operations as the only way to gain information. If the tests were “free”, they would be run all the time in an online system. In the case of maintenance troubleshooting, however, much equipment testing can only be done when the equipment is already removed from service. Tests often include some manual (human) effort, although they also can be automated tests that are simply too expensive in some sense to run routinely.
Tests commonly return binary values. For fault propagation models, where true corresponds to a problem, true would mean that a test verified a particular problem. More realistically, tests have an additional possible value of unknown, since the values are unknown before the test, and because the test may fail to return a value. In some methodology, a value between 0 and 1 (or unknown) might be returned, indicating a fault extent.
There are examples of the need for formalizing testing in almost every industry. This is very common in electronic equipment troubleshooting. For instance, a manual test might be “Measure the voltage between point A and B. Is it less than 0.1 volts? A true value could provide evidence for a short circuit.
The need for manual or automated non-routine tests is very common in the process industries. For example, “Send a field operator to see if a bypass valve is open”. A true result could explain why a control valve cannot adequately reduce a flow rate. Due to cost, not every state affecting plant operations has a sensor associated with it, so that manual observation may be needed. Another example is sending the field operator to verify a drum level by looking at a local gauge. This cross-check may be the only way to determine if a particular sensor has failed. Another common example is a non-routine lab test. If a particular form of catalyst deactivation is suspected as a cause of a detected problem, only then would special lab tests be run. Lab tests looking for corrosion products or salt and water in a hydrocarbon stream might only be run if a leak is suspected in a heat exchanger cooled with salt water. These examples required manual input, but even fully automated tests are not run routinely in cases where the test itself would disturb normal operations. For instance, in diagnosing problems in a control loop, it may be necessary to introduce some disturbances. This used to be done manually, but newer systems could allow some automation of this testing. Finally, especially with older control systems, the control network and the network interface might not support the full load of all possible variables being analyzed on a remote supervisory computer. As a result, some values are only checked after initial fault detection. Although the mechanism for getting these test values might be identical to the mechanism for getting the routinely-scanned variables, the fact that they must be specially requested is what makes them tests.
The running of specific additional tests only after initial fault detection is also common in network management. A major reason is to minimize network traffic and load on both the diagnostic manager and the agents. Another reason is the disruptive nature of some of the tests. For instance, a test may be to reboot a device and see if the problem disappears.
In all of the examples above, there is no need to routinely perform these tests. It is only after initial fault detection, and particular faults are suspected (“narrowing of diagnostic focus”), that these particular tests would be run. And in all cases, the only tests requested are those that can provide evidence for or against the possible faults that could have led to the detected problem. And in all these cases, a testing strategy distinct from routine scanning reduces “costs” of running the diagnostic system.
When testing is part of the fault isolation strategy, a portion of the overall diagnostic system is involved in test planning, interactions to request and receive test results, and then re-plan. Since there will usually be many possible tests, there is a question of prioritizing which ones to request first, and how many to request at once. The system needs to only request tests that provide information about suspected faults that could explain the initial detected problem. Prioritization needs to consider the cost of testing, which might include the materials cost, labor cost, time to get a result, and so on. That test cost could be balanced against the information value of the test and the severity of the detected problem or potential root causes. If the problem is critical, more expensive tests are justified. There are parameters to set such as the test costs, the number of allowed outstanding manual and automated test requests awaiting results, and so on. For instance, in the case of manual tests, you don’t want to overload the operator with too many questions at once, especially if some of the tests provide some redundant information. (One test that returns “OK” may rule out some of the same possible fault suspects that other tests could rule out, in systems where test results are believed completely rather than combined in some way as evidence.)
CDG/Symcure is an example of a diagnostic system that includes formal support for testing as well as passive monitoring. Classic expert systems based completely on manual question/answer sessions with humans (such as those for medical diagnosis) are also examples of systems based on testing. Any question asked of a person is a form of test, and a strategy is needed to determine what questions to ask.
(Return to A Guide to Fault Detection and Diagnosis)
Copyright 2010 - 2013, Greg Stanley