Home About Us Products Services Examples Tech Resources Contact Us


	Home > Tech Resources > Fault Diagnosis > Fault Management >

Fault Management - the Overall Process and Life Cycle of a Fault

Diagnosis Subtopics:

This page examines the overall processes associated with fault management, as part of
A Guide to Fault Detection and Diagnosis.

“Fault management” is a term used in network management, describing the overall processes and infrastructure associated with detecting, diagnosing, and fixing faults, and returning to normal operations. Roughly speaking, this is referred to in the process industries as “Abnormal Condition Management” (ACM), or a term trademarked by Honeywell (and hence avoided by other vendors): “Abnormal Situation Management” (ASM).

The overall process of managing the complete life cycle of a fault generally consists of the following steps:

Immediate discarding of data from obviously-failed sensors, sensors already known to be failed and still likely to be under repair, or undergoing calibration
Filtering to reduce high frequency noise
Event generation (if needed, depending on the techniques used)
Problem detection
Problem diagnosis (isolation)
Predicting the impact of the detected and diagnosed problem
Event correlation - filtering alarms and grouping correlated messages for a simpler user interface
Mitigation actions (steps taken while awaiting repairs, to minimize the impact of the problem
Corrective action (action to repair the problem
Return to normal operations after repairs are completed
Postmortem analysis and corrective actions to prevent recurrence or optimize maintenance policy

The problem detection and diagnosis includes sensor problems as well as problems in the monitored equipment and systems.

Event correlation includes eliminating redundant messages and alarms, and grouping messages related due to common causes or cause/effect (downstream effects of an existing message). This is often mainly for the GUI, to reduce operator overload and provide better analysis. It may include diagnosis of a root cause. However, it only needs to reduce the number of messages, recognizing that they are related even without a complete root cause analysis. This step is common in network management, and also is becoming more popular as “alarm filtering” in the process industries.

The mitigation actions, corrective actions, and steps taken to return to normal operations are often done manually, but could potentially be automated through workflow management. There are additional workflow steps related to problem notifications and acknowledgment, problem escalation if problems are ignored, filing of work requests, assignment of repair tasks, and notifications of repair completion (or automatic detection of completion). These might be manual or automated steps.

Regarding the postmortem: asset management best practices in most industries include analysis of repeated failures, to take additional steps to reduce the number of repeated problems (such as changes in maintenance practices or changes in event thresholds to reduce the number of false alarms). This is a separate, cyclic business process

(Return to A Guide to Fault Detection and Diagnosis


	Share this page:

Fault Management - the Overall Process and Life Cycle of a Fault

Share this page: