[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

2619.0. "alarm exception questions" by COL01::LUNT () Tue Mar 24 1992 10:03

Hi,

	I have the following questions concerning alarms exceptions.
I am using T1.2.4 base level of MCC running on VMS 5.5.

1. When I submit alarms that should run every 15 minutes, and an exception
occurs, then the next exception occurs 5 minutes later not 15 when the
error has not cleared thereby causing a second exception. Is this correct? 
and is this "exception timer" changeable.

2. How is the Severity of an exception determined? Is there a way to set 
this so that it is not always Indeterminate? Why does it not take on
the Severity of the alarm that was set?

3. Why is an MCC alarm tied to a certain revision of the
Alarms_mail_exceptions.com file in mcc_common? Is there a way to change
this. Please see log file and below that the Rule Creation commands. I
changed the command file and thereby created a new revision number, but did
not notice that my exceptions were not working correctly until I had a
bridge hang and I recieved no exception. 

LOG FILE:

$!
$! This command procedure is always run when anybody on the entire system
$! logs in. It is equivalent to LOGIN.COM except that the instructions
$! contained herein are executed everytime anyone on the VMS system
$! logs in to their account.
$!
$! For interactive processes, turn on Control T, and set the terminal type
$!
$ IF (F$MODE() .EQS. "BATCH") THEN GOTO EXIT
$EXIT:
$EXIT
$! FILE: MCC_ALARMS_SECURITY.COM
$!  
$!  **************************************************************
$!  * ********************************************************** *
$!  * *                                                        * *
$!  * *  Copyright (c) Digital Equipment Corporation, 1990     * *
$!  * *  All Rights Reserved.  Unpublished rights reserved     * *
$!  * *  under the copyright laws of the United States.        * *
$!  * *                                                        * *
$!  * *  The software contained on this media is proprietary   * *
$!  * *  to and embodies the confidential technology of        * *
$!  * *  Digital Equipment Corporation.  Possession, use,      * *
$!  * *  duplication or dissemination of the software and      * *
$!  * *  media is authorized only pursuant to a valid written  * *
$!  * *  license from Digital Equipment Corporation.           * *
$!  * *                                                        * *
$!  * *  RESTRICTED RIGHTS LEGEND   Use, duplication, or       * *
$!  * *  disclosure by the U.S. Government is subject to       * *
$!  * *  restrictions as set forth in Subparagraph (c)(1)(ii)  * *
$!  * *  of DFARS 252.227-7013, or in FAR 52.227-19, as        * *
$!  * *  applicable.                                           * *
$!  * *                                                        * *
$!  * ********************************************************** *
$!  **************************************************************
$!  
$! FACILITY:
$!     MCC -- Management Control Center
$! 
$! ABSTRACT:
$! 
$! This purpose of this command file is to normalize DCL symbols, set
$! user's privileges to NOALL, restrict logical trasnlation to searching
$! on the system logical translation table, all as a security measure.
$! 
$! When a RULE fires, this command file will be queued to the SYS$BATCH
$! or user specifed execution queue.  This file will call the user's
$! specified command file after taking the above security precautions. 
$! It will then set the user's environment back to its original state.
$! 
$! The LOGFILE for the execution of the batch job is written to the
$! directory where the user specified command procedure is located.
$!
$! The parameters p1-p8 are displayed if the MCC_ALARMS_FM_LOG debug mask
$! is set to ??
$! 
$! 24-AUG-1990   Aud Orenstein	- P8 has datafile name concatenated
$! 30-AUG-1990   Anil Navkal      Deleted the data file: 
$! 05-SEP-1990   Aud Orenstein    Set ON for error handling
$! 06-DEC-1990   Aud Orenstein    Why doesn't the data file get deleted?
$! ========================================================================
$! ========================================================================
$
$
$!++
$! If the MCC_ALARMS_FM_LOG set to ALARMS$M_LOG_SECURITY_WRITELNS then write
$! the values of P1 through P8 
$!--
$
$ ALARMS$M_LOG_SECURITY_WRITELNS = %X08
$ 
$ ALARMS$STR_MASK = f$trnlnm ("MCC_ALARMS_FM_LOG")
$ ALARMS$INT_MASK = f$integer ("%x"+ ALARMS$STR_MASK)
$ ALARMS$BIT = ALARMS$INT_MASK .and. ALARMS$M_LOG_SECURITY_WRITELNS
$
$    queue = f$getqui ("DISPLAY_JOB", "QUEUE_NAME",, "THIS_JOB")
$    write sys$output "Current queue:	  ",queue
Current queue:	  MCC$ALARMS
$
$ if ALARMS$BIT .gt. 0
$ endif
$
$!++
$! Delete all the users symbols.  This is done to NORMALIZE the DCL symbols
$!--
$
$ D == "DELETE"
$ D/SYMBOL/GLOBAL/ALL                                                
$
$!++                     
$! DELETE user defined logicals in PROCESS and JOB tables
$! Restrict logical translation to the PROCESS, JOB and SYSTEM logical 
$! translation tables.  GROUP is not used.
$!--
$
$ if ALARMS$BIT .gt. 0 then SHOW LOG

$
$ DEASSIGN/PROCESS/ALL
$ DEASSIGN/JOB/ALL
$ DEFINE/TABLE=LNM$PROCESS_DIRECTORY LNM$FILE_DEV LNM$PROCESS,-
               LNM$JOB, LNM$SYSTEM
$
$ if ALARMS$BIT .gt. 0 then SHOW LOG
$
$!++
$! Save all users privileges and turn them off
$!--
$
$ SAVEPRIVS = F$SETPRV ("NOALL,TMPMBX, NETMBX")
$
$!++
$! Split the DATA filename off from the User's Procudure name
$!--                                          
$
$ COMMAND_PROCEDURE = F$ELEMENT(0, "/", P8)
$ DATA_FILE 	    = F$ELEMENT(1, "/", P8)
$
$!++
$! Set ON incase user's command procedure returns error
$!--                                          
$ on sever_error then continue
$ on error then continue
$ on warning then continue
$
$!++
$! Run the User's Command Procedure
$!--                                          
$ @DISK$DATA:[MCC]MCC_ALARMS_MAIL_EXCEPTION.COM;6 "MCC 0 ALARMS RULE BYLN13_UNREACHABLE"-   	!rulename
		      "DESCRIPTION"-   	!category 
		      "BRIDGE UNREACHABLE "-   	!description
		      "(BRIDGE .EU.BY.BYL.BYLN13 DEVICE STATE <> OPERATING, at  every=00:15:00)"-   	!expression
		      "24-MAR-1992 11:20:38.30"-   	!time
		      "Cannot communicate with target"-   	!dtcrtf or error
		      "@MCC_COMMON:ALARMS_DISTRIBUTION_LISTE.DIS"-   	!notification params
        	      "SYS$SCRATCH:MCC_ALARMS_DATA_11203830.DAT
%DCL-E-OPENIN, error opening DISK$DATA:[MCC]MCC_ALARMS_MAIL_EXCEPTION.COM;6 as input
-RMS-E-FNF, file not found
$ !++
$ ! 	Delete the data file 
$ !--
$ !
$ !	DELETE 'DATA_FILE';
$
$       DELETE/LOG DISK$DATA:[MCC_ALARMS]MCC_ALARMS_DATA_11203830.DAT;
%DELETE-I-FILDEL, DISK$DATA:[MCC_ALARMS]MCC_ALARMS_DATA_11203830.DAT;1 deleted (3 blocks)
$!==========================================================================
$ !++
$ ! clean exit point
$ !--
$ clean_exit: 
$ !
$ !--------------------------------------------------------------------------------------
$ 		exit
  MCC$ALARMS   job terminated at 24-MAR-1992 11:20:42.57

  Accounting information:
  Buffered I/O count:              40         Peak working set size:     445
  Direct I/O count:                41         Peak page file size:      2665
  Page faults:                    519         Mounted volumes:             0
  Charged CPU time:           0 00:00:00.98   Elapsed time:     0 00:00:03.34

===========================================================================

RULE Creation:
==========================================================================
Delete MCC 0 ALARMS RULE BYLN13_UNREACHABLE, in domain = DECNOS_NS:.EU.BY.
BYL.K9_PLT
Create MCC 0 ALARMS RULE BYLN13_UNREACHABLE -
  Category           = "BRIDGE UNREACHABLE ", -
  Description        = "DESCRIPTION", -
  Expression         = (BRIDGE .EU.BY.BYL.BYLN13 DEVICE STATE <> OPERATING, 
at  every=00:15:00), -
  Procedure          = DISK$DATA:[MCC]MCC_ALARMS_MAIL_ALARM.COM, -
  Exception Handler  = DISK$DATA:[MCC]MCC_ALARMS_MAIL_EXCEPTION.COM, -
  Parameter          = "@MCC_COMMON:ALARMS_DISTRIBUTION_LISTE.DIS", -
  Queue              = "MCC$ALARMS", -
  Perceived Severity = CRITICAL, -
  in domain = DECNOS_NS:.EU.BY.BYL.K9_PLT

T.R	Title	User	Personal Name	Date	Lines
2619.1	Some answers	TOOK::MINTZ	Erik Mintz, DECmcc Development, dtn 226-5033	`Tue Mar 24 1992 11:21`	19
	>1. When I submit alarms that should run every 15 minutes, and an exception >occurs, then the next exception occurs 5 minutes later not 15 when the >error has not cleared thereby causing a second exception. Is this correct? >and is this "exception timer" changeable. Don't know off hand. >2. How is the Severity of an exception determined? Is there a way to set >this so that it is not always Indeterminate? Why does it not take on >the Severity of the alarm that was set? It is always Indeterminate. We are considering changing it to take on the severity of the corresponding alarm. >3. Why is an MCC alarm tied to a certain revision of the >Alarms_mail_exceptions.com file in mcc_common? Is there a way to change This is a security feature. There is no way to change it.
2619.2	More Needed	BEAGLE::ANDRADE	The sentinel (.)(.)	`Fri Mar 27 1992 04:29`	40
	Some more on alarms: Clients want, DECmcc to change icon colors for every event. Also not to receive too many messages about the same thing. As an example: Using polling to detect when nodes become unreachable and then reachable again. In order to do this it would help if the alarm color for the exception and for the normal alarm firing could be set to two different colors. And that DECmcc would always set the Icon to the last color alarmed. Also in a CHANGE_OF expression, I find it very strange that the EXCEPTION doesn't count in setting the known attribute value. I think the attribut value should be changed in this case to "UNKNOWN". Thus with an alarm of the type: (CHANGE_OF(NODE4 X state ,)) , normal firing = clear = blue , exception firing = critical = red I could make the ICON go red (and mail sent, etc) when the node becomes unreachable. (State changes from "on" to "UNKNOWN") Then while the node is unreachable, the alarm exception doesn't fire again because the state remains "UNKNOWN". But when the node comes up again, the state changes from "UNKNOWN" to "on" and the alarm fires again normally changing the ICON to blue (and sends mail, etc). And lastly while the node remains reachable, the alarm doesn't fire again either because the state remains "on" all the while. Hoppefull, Gil P.S. I had several concerned people mentioning something similar to me. (how with DECmcc you can never see the node state = off, etc. ) because then the node is unreachable and its the exception that fires.
2619.3	Try IpReachability	TOOK::MINTZ	Erik Mintz, DECmcc Development, dtn 226-5033	`Fri Mar 27 1992 07:49`	9
	For SNMP nodes, I would suggest using the IpReachability attribute, which should return up or down rather than the exception. Unfortunately, it is too late in the development cycle for us to change functionality in alarms for V1.2. But I will file your note as a suggestion QAR so it can be considered later. -- Erik
2619.4	Occurs N Times + 'generic' reachability?	TOOK::MCPHERSON	Save a tree: kill an ISO working group.	`Fri Mar 27 1992 08:19`	22
	> Clients want, DECmcc to change icon colors for every event. Also not to > receive too many messages about the same thing. Maybe I'm reading this incorrectly, but this statement seems to contradict itself. Would an "OCCURS N TIMES" rule format do anything to help your here? Re: your questions about reachability et al: For objects that have 'reachability' defined as an attribute (e.g. IPreachability for SNMP entities) then you can do what you want. For all the rest, you will still need more fancy processing (which is not in 1.2)... Maybe one could make a case for a 'generic' attribute called "Reachability" that could be 'inherited' (sorta like reference attributes) ? Might that be a more reasonable/predictable way to solve the general problem instead of trying to infer from alarm rule exceptions firing? /doug
2619.5	Re. 3 & .4	BEAGLE::ANDRADE	The sentinel (.)(.)	`Mon Mar 30 1992 05:29`	52
	Re.3 Thanks Erik Too bad it can't be done for v1.2, I guess I should have mentioned it sooner. But it seemed so obvious that something like this was needed... Re.4 Doug A generic Reachability attribute that is inherited certainly seems like something worth doing. A CHANGE_OF rule would fire when the node first becomes rechable or unreachable. However, would the ICON colors change as well with the node state. This would require being able to set ICON colors by the "REACHABILITY" attribute state. The "UNKNOWN" value for attributes I asked for, is in fact the same as your "REACHABILITY" attribute. If you implement it for everything. Not just the NODE rechability but every attribute's rechability. Also people still don't want to be swamped by exception alarms... Exceptions should fire only when the reason for the exception changes. I also sugest that it would be very usefull to have the exception and the normal alarm firing turn the ICONs to different colors. This is desirable no matter what else is done. When DECmcc fails to get some data then all involved attribute values should be set to "UNKNOWN" or "UNREACHABLE". For Status attributes, this just means an extra valid state. For String Attributes this is just another string. For Numeric attributes it maybe more difficult, it means associating a state variable to everyone of them. One use of such information is to get node rechability state, using CHANGE_OF alarm rules. Another use is to notify you ONCE (both with the ICON color and your choosen method) when any attribute you are watching for becomes unreachable, and to notify you ONCE again when it becomes reachable, for whatever reasons. For example: You have an alarm to inform you when a line's utilization goes over 80%. DECmcc would fire the rule normally as requested, but it also warns you when the UTILIZATION data is not available for whatever reason. (No need for you to create extra rules, and no fear of being swamped by exception alarms) Regards, Gil *** Another thing that would be very usefull, would be to let us the DECmcc users. Set the ICON color from the ALARM user command procedure. Using a command like: "MCC> set entity X color Y, in domain Z" We could then choose the ICON color, based on whatever data item we choose and acording to the our specific enviroment. It would also be usefull for Demos, and the like.
2619.6	smart availability	SKIBUM::GASSMAN		`Mon Mar 30 1992 09:04`	11
	As long as you are talking about reachability - one of the items that MSU had over HP (as of hot staging for last october's interop) was the attribute it looks at for the reachability map. MSU digs down into the interfaces and checks their status - so during the 'frequent' poll, it not only checks the host, but also it's circuit information. MSU is actually quite good at finding and displaying particular circuits down due to this definition of reachability, and is one of MSU's competitive features MCC needs to emulate as the LAN/internet market is sought after. bill
2619.7	DECmcc does have exception color	TOOK::R_SPENCE	Nets don't fail me now...	`Mon Mar 30 1992 13:27`	4
	By the way, DECmcc DOES have a seperate color for exceptions, it is the color associated with the severity INDETERMINATE. s/rob
2619.8	more discussion	MCDOUG::MCPHERSON	Save a tree: kill an ISO working group.	`Mon Mar 30 1992 15:18`	56
	> A generic Reachability attribute that is inherited certainly seems like > something worth doing. A CHANGE_OF rule would fire when the node first > becomes rechable or unreachable. However, would the ICON colors change > as well with the node state. This would require being able to set ICON > colors by the "REACHABILITY" attribute state. The behavior of the product is this: Icon color is associated with an event or alarm rule severity. If there is an event or alarm associated with this REACHABILITY attribute, THEN the object's icon will change color. > Also people still don't want to be swamped by exception alarms... > Exceptions should fire only when the reason for the exception changes. Sorry. Your interpretation for EXCEPTIONs firing is in conflict with DECmcc's prescribed behavior. Exceptions fire when an alarm rule CANNOT be evaluated. Period. That is the prescribed behavior. > I also sugest that it would be very usefull to have the exception > and the normal alarm firing turn the ICONs to different colors. This > is desirable no matter what else is done. There is a default notification severity (hence color) for EXCEPTION: the one associated with severity = Indeterminate. > When DECmcc fails to get some data then all involved attribute values > should be set to "UNKNOWN" or "UNREACHABLE". For Status attributes, > this just means an extra valid state. For String Attributes this is just > another string. For Numeric attributes it maybe more difficult, it means > associating a state variable to everyone of them. This may be true in your particular requirements, but certainly not universally. Might it be that what you're hinting at is some indicator in an entity's description that describes the 'relationship atomicity' (for want of a better description) of a group of attributes. I.e. if any member of a given attribute partition is not returned, then set a flag that indicates "data is suspect". Also remember: Alarm rules gotta do what alarm rules gotta do. They look for the data needed to satisfy rules in a BOOLEAN fashion. I.e. there is no "Maybe. Come back later" state for an alarm rule; they are either TRUE, FALSE or INDETERMINATE. INDETERMINATE means that it couldn't evaluate either TRUE or FALSE and that is considered an EXCEPTION (and not the rule.... ;^) ) >Another thing that would be very usefull, would be to let us the > DECmcc users. Set the ICON color from the ALARM user command procedure. > Using a command like: "MCC> set entity X color Y, in domain Z" > We could then choose the ICON color, based on whatever data item we > choose and acording to the our specific enviroment. > > It would also be usefull for Demos, and the like. Use the Data Collector AM & sample code & specify a target entity in the event. That'll do what you want fairly simply.
2619.9	More	MAYDAY::ANDRADE	The sentinel (.)(.)	`Tue Mar 31 1992 12:33`	80
	Re.6 (Bill) I agree DECmcc alarms need more functionality. Alarms should be more then a simple YES/NO check of an entity's attribute. Having DECmcc alarms do one thing per alarm, means that users have to create a lot of alarms should not have been needed. And this is is what I am sugesting, making the alarms, keep track of the availability of the data they check as well as the data itself. Then passing the availability data in a effective maner to the user. Reducing the number of alarm rules, as well as the number of firings those rules do is one of my major goals. What I want is an ICON map that mirrors DECmcc's knowledge of the CURRENT state of the network. (This includes an indication of alarmed entities that are unreachable) With ONE mail sent to the appropriate people when something serious happens. A node becoming unreachable is serious, I just don't want to hear about it thousands of times. Re.7 (Rob) > By the way, DECmcc DOES have a seperate color for exceptions, it > is the color associated with the severity INDETERMINATE. This is news to me, it must be v1.2 functionality. My v1.1 never did this. Re.8 (MCPHERSON) >The behavior of the product is this: Icon color is associated with an event or >alarm rule severity. If there is an event or alarm associated with this >REACHABILITY attribute, THEN the object's icon will change color. I know that each alarm rule has ONE color associated with it. What I sugested is that it maybe usefull to have MANY colors associated with a single rule. Avoiding the need to create many rules to do the same thing. Something like "STATE= (ON=green, OFF=red, UNREACHABLE=orange)" or "UTILIZATION= ([<80]=green, [>=80]=yellow, UNREACHABLE=red)" >Sorry. Your interpretation for EXCEPTIONs firing is in conflict with DECmcc's >prescribed behavior. Exceptions fire when an alarm rule CANNOT be evaluated. >Period. That is the prescribed behavior. I know that this is how they work. But consider this, you are polling 10 systems (with 3 alarms each) in an ethernet segment every 10 minutes. This means if that segment becomes unrechable for a day (has happned) you will receive over FOUR THOUSAND alarm exceptions in a single day informing you that those systems are unreachable. (and if you requested mails...) That is why I sugest to reduce the work load of the users (and DECmcc) that if those alarm exceptions informed the user ONCE per alarm that the system is unreachable, and ONCE again when it becomes rechable. It would be a lot better; 60 alarm exceptions as opposed to over 4,000. >Also remember: Alarm rules gotta do what alarm rules gotta do. They look for >the data needed to satisfy rules in a BOOLEAN fashion. I.e. there is no >"Maybe. Come back later" state for an alarm rule; they are either TRUE, FALSE >or INDETERMINATE. INDETERMINATE means that it couldn't evaluate either TRUE or >FALSE and that is considered an EXCEPTION (and not the rule.... ;^) ) I agree a rule gotta do what its gotta do, all I am saying is that exceptions should fire TWICE ONLY. Firing ONCE when the rule becomes INDETERMINATE and ONCE AGAIN when the rule becomes DETERMINATE again. If as indicated, v1.2 sets the ICON to the INDETERMINATE color "WHILE" the rule evaluation is INDETERMINATE. Then these two things together would provide what I requested in my original reply. >Use the Data Collector AM & sample code & specify a target entity in the event. >That'll do what you want fairly simply. Thanks for the information, I will look into it.
2619.10	Wish->QAR->Engineering's 'to do' list...	TOOK::MCPHERSON	Save a tree: kill an ISO working group.	`Tue Mar 31 1992 13:21`	19
	I urge you to file a QAR so that your suggestions about changes to the post 1.2 product will get in the queue. Otherwise, they're likely to remain 'nice ideas'. BTW: I understand your requirements and they would be nifty enhancements; you just 'cahn't get there from heah' right now. /doug P.S. If you're getting tons of mail on your EXCEPTIONS, then you need to change the exception procedure to NOT send mail. At least you wouldn't get flooded with mail when a segment goes unreachable... P.P.S You can also create "rules on rules" that look at the counters associated (I.e. if it's unable to evaluate >10 times, then open the pod bay doors... blah blah blah. /doug
2619.11	ok	MAYDAY::ANDRADE	The sentinel (.)(.)	`Thu Apr 09 1992 06:53`	8
	Re.10 /doug Certainly, I will open a QAR if it will help, but Erik Mintz (re.3) has already done so for me. These last replies are just to iron the functionality details out. Gil
2619.12	correlation of events	TOOK::CALLANDER	MCC = My Constant Companion	`Tue Apr 21 1992 11:45`	18
	to make things a bit clearer, what you are asking for has already been defined by the standards on alarming, they call this function event correlation. There exists in the profile (the document that explains how to implement a standard) a description of the how to's and when to's for event correlation. The idea behind correlation is so that an event is only reported once, and another report doesn't occur until the condition has been changed (better or worse). In DECmcc we have implemented correlation at the PM level and not the FM level for V1.2 (though it needs to be handled in the FM; but not until we figure out how to handle notification of these events when user processes come and go between the leading edge and trailing edge of an event). If you select an entity and ask to see it's list of notifications (display notifications) you will see only the correlated list, not EVERY event that has been reported. We hope this helps, but are aware of the limitation that the implementation in the PM has on the user. jill
2619.13	We hope to see this functionality also!!!!!	COL01::LUNT		`Tue Apr 28 1992 06:01`	7
	Hi, This correlation of alarms is really needed. We also hope to see it in the follow on version to 1.2. Julie Ann