Event Oriented Fault Detection, Diagnosis, and Correlation
This page examines event-oriented approaches as a part of the white paper
A Guide to Fault Detection and Diagnosis.
Event-oriented diagnostics vs. variable-oriented diagnostics
An event represents a change of state of a monitored object. Alarms are examples of events. Events have properties that include at a minimum a time stamp, an associated object, and an event category. For instance, a high-temperature event occurs at a particular time, is associated with a particular object such as a sensor (“T101”), and has an event category such as “high-temperature”. Events typically have additional attributes. In some systems, an event also carries a value such as “true”. If the associated object attribute was used to name a sensor, then there might be another attribute to contain the name of the equipment monitored by the sensor (“Tank 5”). There may also be some associated text for user interface or analysis purposes. Events may also include information about additional related objects. For instance, in network management, an event may be associated with loss of communication from node A to node B, so references to both nodes are attributes of the event. If the system is not object-oriented, these attributes might mostly be embedded in a text message, ideally in some standard format for easy parsing.
Events do not need to be associated with sensors or specific process variables. For instance, in network management, there are many events that are just associated with a piece of equipment or software, such as “High CPU utilization in router 8”, or “Interface card 5 of router 8 shut down”. (Either might explain a “link down” category of event sent by a different router). These types of events are becoming more common in process plants as well, due to the integration of more intelligent subsystems that communicate diagnostic information. Smart instruments can send many types of events. Subsystems such as sequence controllers associated with licensed technology send events related to problems in the subsystems they control. For instance, in a refinery, there may be subsystem controllers for packaged dryer systems, or even major portions of process units such as catalytic reformers.
Data storage and diagnosis involving events is fundamentally different than diagnosis involving specific variables.
Diagnosis using equation-based models, for instance, requires a fixed set of state variables, measurements, and conclusions. Conceptually, this information could all be placed in one or more large vectors of fixed size, and every variable has a value (or is flagged as unknown). With matrix-oriented techniques in particular, all variables and conclusions are calculated each time the diagnostic analysis is run. That doesn’t scale well to large systems. Other approaches (as in the GDA case) reduce the computational load by only propagating significant new information, but even that system still has to define and maintain space in memory for every state variable and conclusion.
On the other hand, event-oriented systems do not have a fixed size to represent state or conclusion representation. One event might be repeated numerous times (the same in every respect except for time stamp). Events that do not occur require no storage. The general approach in event-oriented systems is “management by exception”, where the only events stored and analyzed are those that might represent problems or return from problems.
Conceptually, events for event-oriented fault management are stored in a database rather than in fixed vectors. There might be an actual database, or the "database" might just be an event log. (In that case, analysis will require parsing of the event text to retrieve attribute values). When the highest performance is needed, recent events may be stored online in memory in a compact form for rapid retrieval
An example of an event-oriented diagnostic system for management of an internet service provider is described in Intelligent Management for Large Networks . It used the Integrity framework.
Alarm management systems are examples of event-oriented systems in the process industries. However, the main goal of most of them is not online diagnosis. Instead, they support a business process to improve safety by analyzing the alarm history and modifying alarm limits or adding/removing alarms. A major goal is to minimize the number of alarms so that truly critical information is not lost among nuisance alarms, avoiding operator overload. It supports a workflow based on offline analysis of data collected online. But some alarm management products also support event correlation, described next.
Drawing conclusions from multiple events is called event correlation. Event-oriented systems often provide event correlation functionality. Event correlation includes eliminating redundant messages and alarms (filtering), and aggregation. Aggregation is grouping messages that are related due to common causes or cause/effect (downstream effects of an existing message). This is often mainly for the GUI, to reduce operator overload and provide better analysis.
Event correlation overlaps detection and diagnosis. It may even include diagnosis of a root cause. However, it only needs to reduce the number of messages, recognizing that they are related even without a complete root cause analysis. This filtering and aggregation is a major goal of event correlation in network management, where the large number of events would otherwise overwhelm operators of large networks.
Event correlation is common in network management, and is supported in major network management products like such as the HP software formerly known as OpenView, as well as specialized add-on products like the former SMARTS InCharge.
Some alarm management products for the process industries (such as those by UReason) also provide online alarm filtering and aggregation of exception events -- reducing the number of alarms presented to the operator in real time by packaging up related alarms into a single, higher-level alarm message. This event correlation can result in a diagnosis. But there often aren’t enough alarms implemented to support full diagnosis, because the alarms mainly represent extreme conditions.
Event correlation by query of the event history or related techniques
One type of event correlation used for diagnosis resembles database analysis. Each event includes the attributes needed for diagnosis. Correlation can then be based on a query of the "database". This performs the equivalent of searching for logical combinations of variables, but that combination is embedded in the query, rather than in a specific logic tree. For instance, if an event sent by node X in a communications network reports a “link down” category of event for node Y, a query can be done to see if there are “link down” events from other nodes also reporting loss of communications with node Y. If there are multiple events indicating loss of communication with node Y, then node Y is diagnosed as the cause of the failures. These queries are the most useful if they can identify objects in the query using relationships between objects in the observed system. For instance, In the above example, an effective query would be over objects connected to Y. Queries also often include string pattern matching with wild cards.
Similar analysis could be done in process plants. For instance, suppose there is a low-pressure event in the light ends unit of a refinery. We need to determine if this is a sensor problem or a process problem. In an event-oriented system, we would issue a query looking for other low-pressure events and low-temperature events in nearby equipment. If there were multiple such events, we would declare a low pressure problem for the unit and investigate further. If there were few of those events, we might isolate the problem to that sensor.
Often, specific trigger events are monitored, that then lead to the subsequent queries. This scales much better than approaches where the diagnostic manager needs to calculate information about every variable all the time. This is management by exception -- the manager only tracks the events associated with abnormal operation and events indicating a return to normal.
These sorts of analyses were pioneered for network management, since they scale well to very large systems. Examples are shown in “Using Expert Systems to Manage Diverse Networks and Systems”, including the graphical language OPAC that had explicit blocks to perform the queries.
The process industries have not used event correlation by query, even though the potential is good. Part of the reason may be that there is a strong tradition and infrastructure supporting events only as alarms. Alarms mainly represent extreme, unsafe conditions. Alarm thresholds need to be set far from normal operation to avoid distracting operators with nuisance alarms. Since alarms represent only extreme conditions, they aren’t of much value in detecting and diagnosing smaller problems, problems that are just getting started, or problems related to quality or optimal operation. And, the number of alarms is limited to avoid nuisance alarms and redundant alarms. But if events were generated on a more routine basis representing smaller deviations from normal operations, correlation engines could make good use of them. Several changes will have to be made for this to work:
- Significantly increase the number and types of events generated, so that there will be enough redundancy of information to detect and diagnose faults, and filter out events caused by noise or unimportant transient conditions
- Do not routinely show all these events to operators - keep them separate from the safety-related alarms and only display them as explanation of diagnostic conclusions when needed. Most of these events would be unimportant by themselves, and would be considered nuisance alarms if they were mixed in with traditional alarm messages.
There are related techniques addressing the same problems. They also follow a management by exception approach, but do not use an explicit event history query. Instead, they maintain exception events in memory and link them as new events arrive. CDG/Symcure is one example of that, used in both the process industries and in telecommunications networks. It links and diagnoses the events using a causal directed graph model of abnormal events.
Products for the process industries that perform alarm filtering (such as those from UReason) have the infrastructure to manage exception events and relate them for purposes of root cause diagnosis. However, they they have been marketed mainly for alarm management.
Correlation or diagnosis based on sequences of events
Some systems, especially those engineered to follow a state transition diagram for normal behavior by design, may benefit from analyzing the sequence of observed events. For systems whose underlying behavior is described by continuous variables, diagnosis is also sometimes based on the sequence of observed events.
But, the time lags introduced by filtering, and threshold selection for conversion to the binary form of events can also make diagnosis based on the a sequence of events problematic. The filtering, and initial values before a failure, can change the order in which the events are generated. So the event order seen by the diagnostic system is hard to predict, and depends on the initial conditions at the time of failure. Unless the sequence occurs on a much slower time scale than the filtering of the underlying continuous variables, it is better instead to query for related events over a time period without specifying the exact order. The occurrence of specific trigger events is the signal to initiate the query for related events.
(Return to A Guide to Fault Detection and Diagnosis)
Copyright 2010 - 2013, Greg Stanley