Fault Management - the Overall Process and Life Cycle of a Fault
This page examines the overall processes associated with fault management, as part of
A Guide to Fault Detection and Diagnosis.
“Fault management” is a term used in network management, describing the overall processes and infrastructure associated with detecting, diagnosing, and fixing faults, and returning to normal operations. Roughly speaking, this is referred to in the process industries as “Abnormal Condition Management” (ACM), or a term trademarked by Honeywell (and hence avoided by other vendors): “Abnormal Situation Management” (ASM).
The overall process of managing the complete life cycle of a fault generally consists of the following steps:
- Immediate discarding of data from obviously-failed sensors, sensors already known to be failed and still likely to be under repair, or undergoing calibration
- Filtering to reduce high frequency noise
- Event generation (if needed, depending on the techniques used)
- Problem detection
- Problem diagnosis (isolation)
- Predicting the impact of the detected and diagnosed problem
- Event correlation - filtering alarms and grouping correlated messages for a simpler user interface
- Mitigation actions (steps taken while awaiting repairs, to minimize the impact of the problem
- Corrective action (action to repair the problem
- Return to normal operations after repairs are completed
- Postmortem analysis and corrective actions to prevent recurrence or optimize maintenance policy
The problem detection and diagnosis includes sensor problems as well as problems in the monitored equipment and systems.
Event correlation includes eliminating redundant messages and alarms, and grouping messages related due to common causes or cause/effect (downstream effects of an existing message). This is often mainly for the GUI, to reduce operator overload and provide better analysis. It may include diagnosis of a root cause. However, it only needs to reduce the number of messages, recognizing that they are related even without a complete root cause analysis. This step is common in network management, and also is becoming more popular as “alarm filtering” in the process industries.
The mitigation actions, corrective actions, and steps taken to return to normal operations are often done manually, but could potentially be automated through workflow management. There are additional workflow steps related to problem notifications and acknowledgment, problem escalation if problems are ignored, filing of work requests, assignment of repair tasks, and notifications of repair completion (or automatic detection of completion). These might be manual or automated steps.
Regarding the postmortem: asset management best practices in most industries include analysis of repeated failures, to take additional steps to reduce the number of repeated problems (such as changes in maintenance practices or changes in event thresholds to reduce the number of false alarms). This is a separate, cyclic business process
Copyright 2010-2013, Greg Stanley
(Return to A Guide to Fault Detection and Diagnosis