Passive System Monitoring vs. Active Testing
This page examines the distinction between passively monitored symptoms and strategies for determining non-routine tests, as a part of the white paper A Guide to Fault Detection and Diagnosis.
In the case of online monitoring systems, many diagnostic techniques assume routine scanning of every variable of interest, or that subsystems such as agents just send events when there is a problem. This can be called “passive monitoring”, where the data or event just shows up as input to the diagnostic system once the system integration is established. However, there are many situations in which this will not be the case. We make a distinction between “symptoms” that are based on routine scanning (and events generated based on that), and non-routine “tests” that must be requested. When tests must be requested by the diagnostic system, there is must be a test planner to decide what tests should be requested.
A test is an operation that is requested, and after some time, returns a value based on the state of the system. That value might be binary - false/true or pass/fail. For fault propagation models, where true corresponds to a problem, true would mean that a test verified a particular problem. More realistically, tests have an additional possible value of unknown, because the test may fail to return a value. In the case of a test such as "Is the voltage less than 1.5 V", it might treated as true/false. However, since this test is based on a measurement with some measurement error, it might be returned as a probability value, because of the uncertainty of the measurement. In some methodology, a value between 0 and 1 (or unknown) might be returned, indicating a fault extent.
These additional tests are only run when needed for fault isolation, because they require additional resources such as time, material cost, labor cost, CPU load, or network traffic, or because they introduce some disturbances into normal operations as the only way to gain information. If the tests were “free”, they would be run all the time in an online system. In the case of maintenance troubleshooting, however, much equipment testing can only be done when the equipment is already removed from service. Tests often include some manual (human) effort, although they also can be automated tests that are simply too expensive in some sense to run routinely.
The need for tests depends on the industry and application
There are examples of the need for formalizing testing in almost every industry. This is very common in electronic equipment troubleshooting. For instance, a manual test might be “Measure the voltage between point A and B. Is it less than 0.1 volts? A true value could provide evidence for a short circuit. The running of specific additional tests only after initial fault detection is also common in network management. A major reason is to minimize network traffic and load on both the diagnostic manager and the agents. Another reason is the disruptive nature of some of the tests. For instance, a test may be to reboot a device and see if the problem disappears.
People in a maintenance department, starting with failed equipment, already know there is a failure. There won't be any online measurements, either. Everything that is done for fault isolation becomes a test. So, test planning is an essential part of the diagnostic process each time. However, once a test result arrives, the diagnostic algorithms must process the results, just as might be done in a purely online diagnoser that just passively monitors the system using sensors.
For monitoring online operations passively using only sensors, there might not be any "tests" in the sense above, and hence no runtime test planning. People building diagnostics for operational systems control centers, e.g., for network management, a manufacturing process (as opposed to the product), or building/HVAC management, tend to think in terms of passive systems without tests -- extensions of the alarm systems they are already familiar with.
But a full-capability online diagnostic system monitoring, say, a process plant like a refinery, should have "tests" as well as simple passive monitoring.
For instance, the system needs to be able to ask for manual field checks by operators. Due to cost, not every state affecting plant operations has a sensor associated with it, so that manual observation may be needed. For example, “Send a field operator to see if a bypass valve is open”. A true result could explain why a control valve cannot adequately reduce a flow rate. Another example is sending the field operator to verify a drum level by looking at a local gauge. This cross-check may be the only way to determine if a particular sensor has failed.
Another common example is a non-routine lab test. If a particular form of catalyst deactivation is suspected as a cause of a detected problem, only then would special lab tests be run. Lab tests looking for corrosion products or salt and water in a hydrocarbon stream might only be run if a leak is suspected in a heat exchanger cooled with salt water.
These examples required manual input, but even fully automated tests are not run routinely in cases where the test itself would disturb normal operations. For instance, in diagnosing problems in a control loop, it may be necessary to introduce some disturbances. The diagnostic system may need to request a step or impulse test to see if a valve, sensor, or controller is working properly (whether done automatically or by a person). This used to be done manually, but newer systems could allow some automation of this testing.
Finally, especially with older control systems, the control network and the network interface might not support the full load of all possible variables being analyzed on a remote supervisory computer. As a result, some values are only checked after initial fault detection. Although the mechanism for getting these test values might be identical to the mechanism for getting the routinely-scanned variables, the fact that they must be specially requested is what makes them tests.
In all of the examples above, there is no need to routinely perform these tests. Any automated collecting and analyzing data not done on a routine basis, or request for human input, can be treated as a test. It is only after initial fault detection, and particular faults are suspected (“narrowing of diagnostic focus”), that these particular tests would be run. And in all cases, the only tests requested are those that can provide evidence for or against the possible faults that could have led to the detected problem. And in all these cases, a testing strategy distinct from routine scanning reduces “costs” of running the diagnostic system.
When testing is part of the fault isolation strategy, a portion of the overall diagnostic system is involved in test planning, requesting one or more tests, receiving test results, and re-planning. Since there will usually be many possible tests, there is a question of prioritizing which ones to request first, and how many to request at once.
For automated systems, many test requests can be issued at once, to get faster results, rather than waiting for each individual result. The big batch of test requests approach doesn't work as well with tests that are really questions to humans. You don’t want to overload the end user with too many questions at once, especially if some of the tests provide some redundant information. (One test that returns “OK” may rule out some of the same possible fault suspects as another test.)
The system needs to only request tests that provide new information about suspected faults that could explain the initial detected problem. In a causal model, for instance, they would be "downstream" of any of the nodes in the suspected fault set. They should not be predictable based on previous tests. Tests whose value could be predicted from previous tests can be categorized as such, and shown in the user interface with their estimated probabilities in case anyone wants to know what the system currently believes. In a causal model, for instance, the new tests are often (but not always) "upstream" of current tests. The intent is to narrow the "diagnostic focus" - either ruling out some suspected faults or adding further evidence to smaller subset of possible faults.
After that constraint, you can optimize based on factors such as the test cost, expected information value, or even severity of the related problems and root causes. If the system predicts the test values (with a probability, if default estimates of root cause failures are available), one theoretical approach is to prefer tests with a probability close to 0.5 -- implying that test result will give the most information.
However, a simple and effective strategy is to run the relevant lowest-cost tests first. Test cost is a general approach to prioritizing. It might reflect materials cost, labor cost, time to get a result, required specialized tools or expertise, and so on. The importance of this may vary with the domain. It could matter even in automated systems. "Cost" could be translated to favor some tests that are faster or more accurate, or actually consume resources. It especially matters in systems involving humans (whether mixed with automated tests or purely human), where levels of expertise or other factors enter. For car repair, for example, think of the vastly different tests possible for "layman", "enthusiast with tools", and "auto mechanic", and the time and effort even for an auto mechanic to take off the cylinder head to visually inspect valves, as opposed to just looking at data.
If conflicting data is discovered, then for tests at a given cost level, even the "predictable" tests may be requested, because it now appears that they will contribute useful information to help resolve the conflicts
CDG/Symcure is an example of a diagnostic system that includes formal support for testing as well as passive monitoring. Classic expert systems based completely on manual question/answer sessions with humans (such as those for medical diagnosis) are also examples of systems based on testing. Any question asked of a person is a form of test, and a strategy is needed to determine what questions to ask.
(Return to A Guide to Fault Detection and Diagnosis)
Copyright 2010 - 2020, Greg Stanley