[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

1781.0. "Alarm Pre-Processing etc." by SUBWAY::REILLY (Mike Reilly - New York Bank District) Thu Nov 07 1991 13:59

    As the trouble ticket note is generating so much heat :-) I decided to
    move my current pet topic out here.

    It is my understanding that the alarms team are adding a number of 
    new features to the v1.2 product.  Could somebody describe exactly
    what extra functionality will be provided? 

    Having made a number of attempt to implement a large rule base for
    DECmcc I now feel that we must plan for some form of alarm
    pre-processing in the near future. Hearing that many of our customers
    are planning to have several hundred polls per hours means that
    customers will be screaming for this functionality in a few months.  At 
    the moment the only reason most of our customers are not polling every
    LAN bridge etc. on their network is that it is too much work to key in all
    these alarm rules and they haven't the time to write automatic load scripts.
     
    It has been implied in another note that a customer who has hundreds of
    alarms triggering at the time has done something wrong.  I don't agree
    with this.  If I create a rule which will watch every bridge on my LAN
    for a spanning tree change, then it is quite possible that all bridges
    will undergo a change within a few minutes.  The fact that a single bridge
    under-goes a spanning change is important and should be reported.  If
    however, 200 bridges have a spanning change than these alarms must be
    treated differently.  

    Another cause of large number of alarms is the classic network
    management problem of declaring everything on the network broken just
    because the management station cannot see it. In DECmcc lingo we call
    these events 'exceptions'.  If an intermediate device between the
    management station and the device being monitored dies then an
    exception will be generated for the remote device.  If the component
    was a router which links a remote site then all devices being monitored
    at the remote site will generate exceptions.  Because of the volume of
    exceptions that can be generated my current customer will no longer allow
    mail messages etc. to be sent when exceptions occurs. Instead  they
    rely on the device that failed to reboot and then alert them because
    the 'seconds operating timer' will have a low value.  This is far from
    a good solution as devices that never return to the network are not
    reported. 

    Given that large number of alarms or exceptions will occur, we now need
    a was of pre-processing these alarms.  I don't believe that much of
    this processing should take place within the Alarms FM. The Alarms
    FM should concentrate on providing the ability to collect large quantities 
    of raw-alarm data in a very efficient manor and perform minimal data 
    reduction.  A separate FM should be able to take the raw data, 
    perform some pre-processing and then feed a 'synthetic' alarm back into 
    the system.  This FM should be rule based and it's rule base exposed to
    the customers.  We already have some of the code to add some real
    analysis to this FM.  The ELM developers have developed a piece of code which
    can automatically generate the LAN's spanning tree, and can add the
    functionality to determine which devices resides on each segment.  If
    that map of the network is kept in memory within the Alarm
    Pre-Processing FM then problems due to bridge failures can be
    detected by noting which devices are causing exception messages. 
    Exception messages which were caused as a result of the device that
    failed can be filtered out. 
    By building similar internal maps for the routing tables of the various 
    protocols failure of routers and the associated 'exceptions' messages
    due to nodes becoming unreachable can be correlated to determine which
    component failed.      

    If this capability is not considered to be an important component of
    DECmcc then I would suggest that the DECmcc kit be shipped with a piece
    of code or an easy API which will allow customers to intercept Alarms
    on their way to the Notification FM, and allow customers to insert their
    own alarms into the system. 

    With many of our competitors adding intelligence to their network
    monitoring probes, we have a chance to add intelligence at the
    'enterprise' level which will be a real marketing advantage.  In a few
    years will will be able to detect that a TP application is building up
    a large IO queue on a network link and be able to look at the link and
    see if there are transmissions problems there.  Systems that can
    perform these functions will have high dollar value. 
    
    - Mike 

T.RTitleUserPersonal
Name
DateLines
1781.1More requirements...RIVAGE::SILVACarl Silva - Telecom Eng - DTN 828-5339Fri Nov 08 1991 06:0366
>    As the trouble ticket note is generating so much heat :-) I decided to
>    move my current pet topic out here.

	Well, its that the trouble ticket functionality seems to be a seperate
domain from the alarm correlation/pre-processing.  I know that a lot of RFPs
come in with thte two areas lumped together, however you could have a trouble
ticket management/tracking system without the alarm correlation (tickets
created manually).

>    Having made a number of attempt to implement a large rule base for
>    DECmcc I now feel that we must plan for some form of alarm
>    pre-processing in the near future. Hearing that many of our customers
>    are planning to have several hundred polls per hours means that
>    customers will be screaming for this functionality in a few months.  At 
>    the moment the only reason most of our customers are not polling every
>    LAN bridge etc. on their network is that it is too much work to key in all
>    these alarm rules and they haven't the time to write automatic load scripts.

	Yes, it would be nice to have the alarms rules management functionality
expanded so that alarm templates could be defined and automatically
associated with managed object classes.
     
>    It has been implied in another note that a customer who has hundreds of
>    alarms triggering at the time has done something wrong.  

	I don't know if it was me but I didn't mean to imply that the customer
was doing something wrong.  A fiber optic cable cut can generate many alarms. 
All I was saying is that the type of alarms/type distribution versus time
should be looked at to detect patterns that may be able to be suppressed or
pre-processed.

>    Given that large number of alarms or exceptions will occur, we now need
>    a was of pre-processing these alarms.  I don't believe that much of
>    this processing should take place within the Alarms FM. The Alarms
>    FM should concentrate on providing the ability to collect large quantities 
>    of raw-alarm data in a very efficient manor and perform minimal data 
>    reduction.  A separate FM should be able to take the raw data, 
>    perform some pre-processing and then feed a 'synthetic' alarm back into 
>    the system.  This FM should be rule based and it's rule base exposed to
>    the customers.  We already have some of the code to add some real
>    analysis to this FM.  The ELM developers have developed a piece of code which
>    can automatically generate the LAN's spanning tree, and can add the
>    functionality to determine which devices resides on each segment.  If
>    that map of the network is kept in memory within the Alarm
>    Pre-Processing FM then problems due to bridge failures can be
>    detected by noting which devices are causing exception messages. 
>    Exception messages which were caused as a result of the device that
>    failed can be filtered out. 
>    By building similar internal maps for the routing tables of the various 
>    protocols failure of routers and the associated 'exceptions' messages
>    due to nodes becoming unreachable can be correlated to determine which
>    component failed.      

	Sounds like we need an artificial intelligence kind of FM with a nice
user interface.  Any volunteers?  8-)

>    If this capability is not considered to be an important component of
>    DECmcc then I would suggest that the DECmcc kit be shipped with a piece
>    of code or an easy API which will allow customers to intercept Alarms
>    on their way to the Notification FM, and allow customers to insert their
>    own alarms into the system. 

	This can be done now.  You can build an FM that receives the alarm
notifications from the alarms FM.

	Carl
1781.2we need at least 2 answersTOOK::MATTHEWSFri Nov 08 1991 10:0545
    I will start by agreeing that customers need much greater functionality
    in the way of event/alarm correlation/filtering than we currently
    are providing. There are many reasons we are not doing more, most
    of them are purely financial. 
    
    Yes, there is a need but it is not clear that there aren't many
    different needs identified here and I doubt that a single "hack"
    /"modification" will do any more than change the perception of
    the need. 
    
    First, We need event correlation/filtering based upon topological
    knowledge of the network. Ie. DECmcc needs to know that to get
    to bridge X from the current instantiation of DECmcc requires
    going through bridge Y, Z, and A. This way a DECmcc "function"
    could make rational decisions about the effects of one of these
    bridges going down and whether to correlate these "events".
    
    You can do it with hardwired procedures as is suggested in the
    note or you can do it via based upon topological knowledge
    (which currently does not exist in DECmcc). Note that the
    hardwired solution breaks down when customers provide dormant
    paths in their network that are enabled by changing forwarding
    database entries and creating alternate topologies. Yes, I know
    that spanning tree doesn't allow cycles. But, dormant (ie. non
    forwarding links which are potential cycles) are allowed and
    can be used. Thus any hardwired procedure based on a static
    topology will fail in this case. I suggest that the there are
    2 answers to this. First, there needs to be a way for customers
    to write simple filtering scripts to reduce multiple events
    occuring in some time scope to be correlated and generate a
    "more meaningful" event to go into the alarms fm. This has
    the advantage that it reduces the load on the alarms fm. It
    has the disadvantage that it increases the delay for receiving
    an event at the alarms fm interface. Second, we need to capture
    topology data about networks including dynamic views and static
    views so that an event filtering mechanism can provide event
    filtering based upon topology. 
    
    Neither of these will be in V1.2. It is possible for the first
    to be done by the next release after V1.2. The topology part
    of the second is being planned but the actual filtering
    algorithm/mechanism is not currently understood enough to
    be planned.
    
    wally
1781.3configuration management if necessary...RIVAGE::SILVACarl Silva - Telecom Eng - DTN 828-5339Fri Nov 08 1991 10:1916
	RE: .2,

>    First, We need event correlation/filtering based upon topological
>    knowledge of the network. Ie. DECmcc needs to know that to get
>    to bridge X from the current instantiation of DECmcc requires
>    going through bridge Y, Z, and A. This way a DECmcc "function"
>    could make rational decisions about the effects of one of these
>    bridges going down and whether to correlate these "events".

	Yes, it is definitely clear that without configuration information in
the system, it is very difficult to do the alarm correlation.  Without the
config info all you will have is the info contained in the alarm reports.

	Are there plans for MCC to do configuration management?

	Carl
1781.4Alarm & Notification APIs?ANDRIS::putninsHands across the BalticsFri Nov 08 1991 10:5634
Re: .1

>	This can be done now.  You can build an FM that receives the alarm
>notifications from the alarms FM.

Where can I obtain information on the Alarms and Notification FM
interfaces?  I would like to be able to write an FM that screens events
and writes them into an external database so another, independent,
program can process them.  This may be in addition to the facilities
offered by PNMP, below.

Re: .2

>First, there needs to be a way for customers
>    to write simple filtering scripts to reduce multiple events
>    occuring in some time scope to be correlated and generate a
>    "more meaningful" event to go into the alarms fm. 

It appears that the PNMP Alarm Handling FM will offer this capability
(see TAEC::PNMP conference, note 3.2):

The PNMP Alarm Handioling FM provides the following alarm handling functions:
1. The ability to define an Operation Context for alarm handling.
2. Collecting alarm reports generated by managed object or generated by
user defined rules.
3. Filtering alarm reports using an ISO conformant discriminator construct.
4. Creating Alarm Objects corresponding to the filtered alarm reports
and maintaining these objects.  Alarm Objects can then be acknowledged,
handled, closed, archived and/or purged.
4. Notifying the PNMP PM when an Alarm Object has been created or when
its status has changed.
5. Escalation (with the creation of a new Alarm Object) when an Alarm
Object has not been acknowledged or handled before a specified time
(change in time per severity).
1781.5PNMP Alarm Handling FMRIVAGE::SILVACarl Silva - Telecom Eng - DTN 828-5339Fri Nov 08 1991 12:0724
>Where can I obtain information on the Alarms and Notification FM
>interfaces?  I would like to be able to write an FM that screens events
>and writes them into an external database so another, independent,
>program can process them.  This may be in addition to the facilities
>offered by PNMP, below.

	The API is the normal interface that all modules use (mcc_call_access
or mcc_call_function).

>>First, there needs to be a way for customers
>>    to write simple filtering scripts to reduce multiple events
>>    occuring in some time scope to be correlated and generate a
>>    "more meaningful" event to go into the alarms fm. 
>
>It appears that the PNMP Alarm Handling FM will offer this capability
>(see TAEC::PNMP conference, note 3.2):

	Yes, it will provide some of this functionality.

	RE: .2,

	Can you expand on your requirements?

	Carl
1781.6As usual, sample code would do wonders..SUBWAY::REILLYMike Reilly - New York Bank DistrictFri Nov 08 1991 12:1530
    
    re: .2
    
    >	This can be done now.  You can build an FM that receives the alarm
    >	notifications from the alarms FM.
    
    If you know of somewhere I could get my hands on some code which 
    retrieves alarms and inserts alarms into the system I would
    be very interested in developing the link to a rule based system.
    I would suggest that code such as this should be shipped with the
    next release of DECmcc.  This will allow customers to use their own
    alarm pre-processing algorithms in the near term.
    
    My current customer would like to define DECmcc alarm rules which
    all have a severity of 'warning' or lower.  When the alarms are passed
    thru the pre-processing algorithm, a synthetic alarm would be
    generated( if needed) with a severity of 'critical'.   The network operations
    staff would only have to watch for alarms which have a severity of
    'critical'.  The original alarms would still be available if needed.
    
    We also need a fix to the problem of all exceptions being flagged as
    'critical' for this scheme to work.
    
    With regard to the development of an AI system to handle this, I
    have heard that the south of France provides the ideal environment
    for the development of AI systems :-). 
    
    - Mike
    
     
1781.7RIVAGE::SILVACarl Silva - Telecom Eng - DTN 828-5339Fri Nov 08 1991 12:175
    With regard to the development of an AI system to handle this, I
    have heard that the south of France provides the ideal environment
    for the development of AI systems :-). 

	Yes, it does!
1781.81.2 notif has alot of what you are asking forTOOK::CALLANDERMCC = My Constant CompanionFri Jan 03 1992 11:3023
    okay, I know this discussion is old but...
    
    The MRM (module reference manuals) ship with the kit. The alarms and
    notif ones will be update along with the rest for final v1.2 shipment. 
    These documents explain the functions supported by each module.
    
    Now as to most of the functionality you have listed, you should find
    alot of it in the notification services in the 1.2 field test kit.
    The logging stuff didn't make it into field test but will be in the
    final product. The log is supposed to allow logging based on the
    filters you have defined. 
    
    As to an open api to do what you are looking for, the data collector
    AM provides an easy to use open interface that allows for inormation
    to be passed into mcc from any application you want to connect to
    the api. The documentation on this is only being distributed to a
    few specific field test sites so that we can first determine if our
    implementation meets the need. If you want more information on
    the data collector please send mail to Anne Pelagatti or Wally
    Matthews (both are located on TOOK::)
    
    jill