Using Expert Systems to Manage Diverse Networks & Systems

With a Focus on Operations

Greg Stanley

 

Using Expert Systems to Manage Diverse Networks & Systems

I. Overview

II. Representation of networks & applications

III. Architectures

IV. Case studies

I. Overview

Managing diverse networks

Major operational goals

Major components

Alarm filtering & correlation examples

Diverse networks & systems

Numerous device types & manufacturers

Circuit switching/packet switching hardware

Data vs. real-time for voice & video

(it's not all ATM yet...)

Different protocols (TCP/IP, CMIP, ...)

Connection-oriented vs. connectionless

LAN/WAN differences

Wireless vs. terrestrial

Changing topologies

Portable computers, wireless, low-earth-orbit satellites)

Complex devices

Sub-objects with one IP address

Subsystem interfaces & proxies

Element management systems

Diverse enterprise network systems

Network, plus software processes and overall applications need to be managed

Software processes & overall application

New Client/Server applications

Applications, including legacy

Services

Resources

e.g., disk

Rapidly changing technology increases the need for flexible systems & rapid development

OSI Network management

Operations areas considered here

Fault management

Performance management

Other areas

Security management

Configuration management

Accounting management

How does a real-time, object-oriented expert system help?

Flexibility

Speed of development - overall development environment

Incremental development environment for rapid development & feedback, partial solutions

Representation power: modelling the systems for use in diagnostics, analysis, prediction, ...

Portability between platforms

Systems integration capabilities

Some operations issues addressed by real-time, object-oriented expert systems

Early detection of problems (proactive)

Predictions from performance or patterns

Alarm/message/event filtering

Suppression of repetitive alarms

Alarm correlation

Grouping of related alarms

Diagnosis

Pinpointing the causes of alarms

Procedure automation

Testing for diagnostic & filtering purposes

Resolving problems

Enforcing standard procedures

Semi-automatic - guiding operator

Online information

Help, topology, hierarchy, relations

"What-if" simulations for analysis or training

Alarm/message filtering examples

Alarm X occurs, then clears by itself within timeout. Suppress it (do not present to operator).

(Also log suppressed alarms for analysis)

Alarm X occurs. Further testing reveals this alarm to be false or to have cleared itself. Suppress it.

Alarm X is repeated n times. Present first alarm only, update a repetition counter

Alarm X is not a real problem until it occurs n times within timeout. Present one alarm only, after n alarms, update repetition counter

Alarm correlation examples

Alarm X and Alarm Y occur within timeout. Suppress these, present new message Z to operator

Alarms X1, X2, ..., X6, sent from different agents, are all complaining about "target" device Y. Acknowledge X1...X6, and send an alarm about Y.

Alarms X1, X2, ..., X8 were all sent by "sender" device Y. Send an alarm indicating suspicious behavior of Y.

Model-based alarm correlation & diagnosis

Models are typically based on connectivity, part-of hierarchy, cause-effect failure models, individual device models such as state diagrams

Multiple failures have occurred on the same LAN segment. Poll the remaining devices - if all fail, then warn the operator that that segment as a whole has failed (e.g., cable break), and acknowledge the individual source alarms

Multiple devices X1, X2, ... are sending messages complaining that they cannot communicate with device Y. Send a message that device Y has failed, and acknowledge all the messages for X1, X2, ...

High-level services requiring particular interface cards X1, X2, ... are all failing. X1, X2, ..., are all plugged into a common backplane or have some other common failure mode. Diagnose and alarm on the common mode failure, and acknowledge X1, X2, ... .

 

Procedure-based alarm correlation, diagnosis & resolution

Alarm X occurs. Wait 60 seconds. Check for symptoms again by polling, or log in to a computer execute some UNIX commands (using remote shell). If the problem is still there, send an alarm, otherwise suppress it (except for an optional log entry).

II. Representation of networks as a basis for applications

Knowledge management view

"Build yourself a graphical language" to more closely match your tool to your domain

Representation in OPA

Knowledge management view

Emphasizes representation of knowledge for applications

Not just data!

- System hardware & software models

- System topology

- Failure & fault propagation models

- Operating rules & procedures

Example: alarm correlation & diagnostics need process topology and device models, also usable in operator training and in planning.

Object-oriented, with graphical representation

Major characteristics of a KBES ("Knowledge-Based Expert System)

KBES represents both qualitative and quantitative models

Object orientation is the key part of modern expert systems

KBES represent information explicitly, rather than embedded in code

analogy: simultaneous mathematical equations vs. set of assignment statements and iterative procedure in FORTRAN, or schematics rather than as a set of statements generating the schematic

Emphasis on building "declarative" descriptions, independent of subsequent use, and easily inspectable by wide class of users

- Goal to simplify representation & re-use of knowledge for multiple purposes

Some KBES's (G2) have strong graphics orientation as part of its declarative knowledge

Static vs. real-time KBES

Development environment

KBES provide powerful new high-level tools for modelling and re-use

High-level descriptions:

- Equipment class implies behavior

- Schematic drawings: connections imply fault propagation, data flow, reachability, reliability

- "Part-of" relation implies fault propagation model

- "Is-a-kind-of" specialization simplifies descriptions

all modems share some common properties

reachability analysis ignores differences between most devices, and may include software processes

- Generic statements utilize these high-level constructs to generate specific diagnosis or simulation, using common attributes

Model declarations are independent of ultimate usage

Qualitative models (e.g., cause-effect)

Portion of a class hierarchy

Portions of class hierarchy (indented form)

| | | TELECOM-DEVICE

| | | | ELECTRONIC-DEVICE

| | | | | LOGICAL-UNIT

| | | | | BUS-NODE -- 1 instance

| | | | | | TOKEN-RING-REPEATER -- 1 instance

| | | | | | LAN-TRANSCEIVER -- 1 instance

| | | | | | | ETHERNET-TRANSCEIVER -- 48 instances

| | | | | | CLUSTER-CONTROL-EXT -- 1 instance

| | | | | | IBM-CHANNEL-CONTROLLER -- 4 instances

| | | | | | Q-BUS-NODE -- 3 instances

| | | | | | | Q-BUS-RS-232-NODE -- 1 instance

| | | | | | HUB -- 1 instance

| | | | | COMM-TWO-PORT

| | | | | | MODEM

| | | | | | | REMOTE-LOOPBACK-MODEM -- 7 instances

| | | | | | | | REMOTE-LOOPBACK-MODEM-RS-232 -- 11 instances

| | | | | | | MANUAL-LOOPBACK-MODEM -- 1 instance

| | | | | | | | MANUAL-LOOPBACK-MODEM-RS-232 -- 5 instances

| | | | | | | IN-HOUSE-MODEM -- 1 instance

| | | | | | | | IN-HOUSE-MODEM-RS-232 -- 3 instances

| | | | | | | MODEM-NO-LOOPBACK -- 1 instance

| | | | | | | | MODEM-NO-LOOPBACK-RS-232 -- 1 instance

| | | | | | PROTOCOL-CONVERTER -- 4 instances

| | | | | | BRIDGE -- 1 instance

| | | | | | REPEATER -- 3 instances

| | | | | | GATEWAY -- 1 instance

| | | | | | ROUTER -- 2 instances

| | | | | | TRANS-LAN -- 3 instances

| | | | | CLUSTER-CONTROL

| | | | | | SMALL-CLUSTER-CONTROL

| | | | | | | IBM-3274-CLUSTER-CONTROLLER -- 2 instances

| | | | | SERVER

| | | | | | TERMINAL-SERVER

| | | | | | | RS-232-TERMINAL-SERVER -- 4 instances

| | | | | COMPUTER

| | | | | | MEDIUM-COMPUTER -- 5 instances

| | | | | | SMALL-COMPUTER -- 11 instances

| | | | | | BIG-COMPUTER -- 2 instances

| | | | | | WORKSTATION -- 27 instances

| | | | | | GMS-NODE -- 124 instances

| | | | | COMPUTER-PERIPHERAL-DEVICE

| | | | | | TERMINAL -- 4 instances

| | | | | | | RS-232-TERMINAL -- 8 instances

| | | | | | PRINTER -- 4 instances

Sample class definitions

 

Using the editor to change the stubs for a class

Example: Icon editor

Example relation used in diagnosis

Example generic rule using connectivity

For any telecom-device D connected to any hub H

if the status of H is failed

then conclude that the status of D is failed

and conclude that the alarm-priority of D = the alarm-priority of H

 

(actual syntax)

Example generic rule used in alarm filtering

For any electrical-device D

whenever any message MSG becomes an-event-for D

and when the count of each message MSG2 that is an-event-for D > 4

then conclude that ....

and start multiple-message-filter(D)

 

(actual syntax)

 

Some benefits of KBES representation

Reduces gaps between system analysis, specification, design, implementation, run-time use, maintenance.

- Explicit models carried through all phases

- Inspectable by all classes of users, not just programmers

Common representation for multiple applications, with one consistent model for development & maintenance

Generic library: default behavior specified for given class of object, connections - no additional special lists to fill out unless object deviates from the defaults

Some features of G2 - the graphically-oriented, real-time Knowledge-Based Expert System (KBES)

Objects with attributes

Class hierarchy for objects, with inheritance of properties and behavior - allowing "differential modelling"

Associative knowledge, relating objects in the form of connections and relations

Structural knowledge (e.g., "part-of" relation)

Representation and manipulation of objects and connections graphically

Generic rules and associated inference engine

Concurrent procedures

Analytic knowledge, such as functions, formulas, differential equation simulation

Real-time task scheduler, supporting concurrency, priorities, time stamping, validity intervals, timed actions, event-driven activity, reasoning within a fixed deadline, history-keeping, data interfaces

Interactive development environment and run-time environment

Graphics

External interfaces for systems integration

An option: "Build yourself a graphical language"

Match tool to domain - reduce semantic gap between tool and problem

Build library of classes & methods (procedures), rules, etc.

Build "configurer" GUI based on cloning objects from a palette, connecting them, filling out tables of attributes

Fairly common in many domains

Common graphical elements

Containment hierarchy/"part-of" for physical areas, common-modes, physical equipment, hierarchy

Objects in a class hierarchy with specialization & inheritance.

Workstation is a-kind-of computer

Abstract classes such as "hardware"

Objects include attributes and methods (procedures), e.g., test methods

Almost everything, whether physical or abstract, is an object

Graphical connections represent physical connectivity, logical connectivity, or relationships such as cause/effect, hierarchy

RTES-based Petri net example

The language

Petri net represents actions & state transitions

Procedures executed at each node

"Token" passed among nodes, split when parallel operations are launched

Explicit concurrency control

e.g., "Rendevous" to re-unite concurrent operations

Used in control & other applications, to execute sequential, procedural operations

The RTES/Object-oriented implementation

Objects represent nodes, rendezvous, token

Methods (procedures) called at each node, using underlying implementation language

Connections (objects) for transitions

Rules or procedures watch for state transitions

RTES-based state diagram example

The language

Diagram represents states & state transitions

Procedures executed at each node

"Token" passed among nodes

The RTES/Object-oriented implementation

Objects represent nodes, token

Connections (objects) for transitions

Rules or procedures watch for state transitions

Implementation simpler than Petri net, similar

Other common graphical approaches

Logic networks (AND/OR gates, etc.)

Input symptoms, output causes

Roughly equivalent to specific rules

Fault trees, decision trees, AND/OR trees, hierarchical fault models, with goal-seeking

Similar objects, different program control

Cause/effect diagrams

Procedures to analyze schematic/map

Representation in OPA

Telecom devices and software processes

includes the "managed objects"

Class hierarchy

Workspace ("part-of", "containment") hierarchy

Containers ("sites", "networks")

alarm and acknowledgement status is propagated up the containment hierarchy

Alarms/messages/events

Relations

Connections - topology information

Test and operator actions representation

Common framework shared between message-handling, OPAC graphics language, and schematics

Example palettes for telecom-devices

 

Example attribute table: telecom-device

 

Container configuration palette

 

Workspace (map) hierarchy I

Workspace (map) hierarchy II

A site

Processing for incoming events

Decode messages as needed, including identification of target, sender, category

Eliminate obvious repetitions by simple message filtering

Create "raw" warning messages

Apply model-based diagnosis, heuristics, procedural reasoning when possible

Acquire additional information & run tests

Select candidate "most likely" failures based on model or other information

Draw conclusions about root causes and sympathy events, prove nodes "good" or "bad"

Cluster remaining alarms into reasonable groups when possible

Automatically fix problems where possible

Notify the operator with summarized alarms and other alarms, guide through repairs

Pass information to trouble ticket system

Recognizing recurring problems & notify system administrator

 

Sample filtered message

Filtered messages

The main message sent would be the ones on the filtered-message handler.

The above message shows up in summary form on the message handler as:

G2-manager-process

 

 

Example filtering scenario: raw messages

Instead of sending all these failure messages to the operator, the following

filtered message would be sent, as shown on the "filtered messages" handler:

 

Filtered version of the previous 22 messages, sent to operator

The details of this message show the original information that went into this

summarized message. The additional-text explanation is assembled automatically.

Decision block with manual input

The endless loop was started for "target" P6S04. The menu above was generated

automatically by the decision block, which had the following table. Note the use of

variables, indicated with the $.

Block pause capability

The menu above was generated automatically by the block-pause-capability,

which had the following table. Note the use of variables, indicated with the $.

 

 

III. OPA Architecture

Overall Architecture/System Integration

Major components

Overall Architecture

OPA Major building blocks

 

Message Management - a message "MIB"

Messages are objects with attributes such as priority, acknowledgement status, time stamp, timeouts for proecedure execution such as escalation, etc.

Message handlers store messages

Individual views or messages handlers can be set up per Telewindows user

Messages are organized and related to "target", "sender", "ID" (category), window, etc., for analysis or browsing.

Unified framework with OPAC, e.g., same "target", "sender", "ID" (OPAC uses"notify" to designate message-handler

Messages determine the priority and acknowledgement status of objects in the schematic (map)

Programmatic access, as well as access by users

User interactions with message handlers

Acknowledgement

Deletion

Optional modification (e.g., comments)

Navigation to find sender, target, etc.

Navigation from schematic objects to browse messages at any level (object, or larger unit with a subworkspace hierarchy)

Systems Integration

GSI C-based library to build custom bridges

GSI runs as separate process, across network

Asynchronous communications

Remote procedure calls

Polling or event-driven

SQL-type interfaces to databases (Oracle, Sybase, ...)

OpenView (SNMP/DM) interface

File I/O and process spawning

 

OpenView bridge

Interfaces G2/OPA to OpenView and general network via SNMP

Runs as separate process, on same CPU as OpenView (G2 generally runs on a different machine)

Written in C, using binaries from GSI library and OpenView library

Works with OpenView DM platform, or the SNMP platform (which is a subset of DM)

Supports standard SNMP get, get-next, set, send-trap, receive-trap

With DM platform, can register for events using HP Event Management Services

XMP calls (which support CMIP protocol) available later

"Blocking" and "Non-blocking" modes

Using the OpenView bridge: mechanics

G2/OPA can initiate interactions, or receive unsolicited traps

G2/OPA can poll, or do management by exception

G2/OPA can communicate directly with other managers, agents, or software (e.g., Bridgeway's Eventix) by getting and sending traps

G2/OPA can change colors on OpenView map by sending standard "status" trap

Operators using OpenView Windows can send traps directly to G2 (via "snmptrap" utility, after configuration of executable icons or menu entries)

G2/OPA communications to or from bridge are G2 remote procedure calls

Using the OpenView bridge: strategies

Might configure G2 as an "intelligent operator", or as an "intermediary", intercepting all alarms on the way to OpenView

Might use polling or management by exception

Polling can cause slow response, poor scaling

Pure management by exception works better when messages have guaranteed delivery, but SNMP datagrams don't guarantee delivery

Large distributed system may require some filtering, parsing & tokenization of alarm messages close to the sources

May want "proxy" or other agents

Future OPA versions may directly help generate intelligent agents when needed

 

File I/O and process spawning

Typical need to launch UNIX processes, receive results, read and write files

example - log in via remsh, do a "ps -ef | grep xxxxx" to find if a particular process is running, and interpret the results, possibly kill a process and start a process

example - again via remsh, check if a file exists. If not, start some process. When the file exists, read its first line and take action based on that first line.

OPAC language blocks directly execute spawns, file I/O

 

IV. Case studies using real-time expert systems

AT&T EasyLink (Commercial electronic mail service): Using OPA for alarm filtering & diagnostics, procedure automation

Intelsat: network monitoring, satellite telemetry monitoring

Stanford Telecom ATM applications: DoD SSCN & SPANet, ANMA ATM manager

Texaco Trading & Transportation

SWIFT (Belgium) - monitoring bank wire transfers

CRT Banca (Italy), others: Remote bank security monitoring

Telefonica (Spain)

Greg Stanley and Associates Home  About Us  Products  Services  Success Stories  White Papers  Contact Info