[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

6044.0. "SNMP ALARMS NOT WORKING AS EXPECTED" by VNZV01::GONZALEZ (el mas astuto) Mon Jul 11 1994 20:02


Hi all,

I've opened a call in Colorado regarding some alarms not working 
    as expected, but I'm getting out of schedule with this problem, so I
    decided to try with the field.

This text is part of one mail I sent to the person working with me so it could
contain some references to other mails or conversations.  I apologize for it.

This is the Description of the problem :

I have configured a couple of alarms that polls the ip reachability of some 
SNMP entities (actually, they are only ip nodes in general, as they're not 
configured to respond snmp queries from my network station).

One of the alarms polls the snmp entities checking for CHANGE OF IPReachability
changes from whatever state it is to UP, clearing (getting the icon blue) any
alarm condition it could have before.

The second alarm polls the snmp entities checking for CHANGE OF IPReachability 
changes from whatever state it is to DOWN, firing with some critical condition.

The problem is with the behavior of the alarm.

I must state that the problem is not specific, as it has happened with 
all the remote nodes.

It is also true, that some (few) times it behave as expected.

I wouldn't expect, because of the configuration of the alarm, two consecutive
firing staying IPREACHABILITY DOWN, as they are happening.  The system must 
detect that the system became reachable before it can fire an alarm saying 
that it got unreachable again.

The system says that IPREACHABILITY is UP several times before saying that
the SNMP node got unreachable again.

Beside, and more important though, is that non host get unreachable 
according to the UCX> PING <TCPIP_NAME> /ALL.

The Customer has remote sites connected through asynchronous connection.  The 
remote snmp hosts (SYNOPTICS AND NEWBRIDGES HUBS) are the ones 
that are seeing the problem. 

Originally, they were configured to polls every 30 seconds, what can be 
considered to be too short, so I changed the ratio to 2 minutes.

That didn't fix the problem.

I read some notes stating that there's a known problem with DEC TCP/IP SERVICES 
FOR VMS v2.0 software, as when you execute more that one "ping" 
simultaneously it got confused and respond that some hosts are reachable when 
actually they are not.

Given this situation I installed DEC TCP/IP SERVICES FOR VMS V3.1 which is 
supposed to fix the problem, but the problem remains the same.

I hope this mail clarify completely the situation I found.

I hope somebody can help me with this, as I know there are some notes stating
the same.

Have someone fixed this up ?

Thanks in advance.


Richard Gonz�lez

These are examples of the rules :

Domain LOCAL_NS:.domain.pto_ordaz Rule SNMP_REACHABILITY_UP
AT 22-JUN-1994 16:31:52 Characteristics

Examination of attributes shows:
	Alarm Fired Procedure = DKA100:[MCC]MCC_ALARMS_MAIL_ALARM.COM;2
	Alarm Exception Procdure = DKA100:[MCC]MCC_ALARMS_MAIL_EXCEPTION.COM;2
	Category = "SNMP"
	Batch Queue = "SYS$BATCH"
	Alarm Fired Parameters = "system"
	Expression = (CHANGE_OF (SNMP * IPREACHABILITY, *,UP), at every 
	00:02:00.0)
	Severity = Clear
	Probable Cause = Unknown

The rule SNMP_REACHABILITY_DOWN is similar except "*,up" is "*,down"


BTW, OpenVMS VAX V5.5-2, Polycenter NM200 v1.3.

T.R	Title	User	Personal Name	Date	Lines
6044.1	Use IP Reachability Poller	BIKINI::KRAUSE	CSC Network Management/Hubs	`Tue Jul 12 1994 11:51`	26
	>One of the alarms polls the snmp entities checking for CHANGE OF IPReachability >changes from whatever state it is to UP, clearing (getting the icon blue) any >alarm condition it could have before. > >The second alarm polls the snmp entities checking for CHANGE OF IPReachability >changes from whatever state it is to DOWN, firing with some critical condition. Possibly a dumb question, but why on earth don't you use the IP Reachability Poller??? It has been designed to do just this and it works well. >I wouldn't expect, because of the configuration of the alarm, two consecutive >firing staying IPREACHABILITY DOWN, as they are happening. The system must >detect that the system became reachable before it can fire an alarm saying >that it got unreachable again. > >The system says that IPREACHABILITY is UP several times before saying that >the SNMP node got unreachable again. Careful, lad! You are running two independent alarm rules. Those two rules are not synchronized in any way. If you have several 'downs' between samples this could explain what you see. Again: use the IP Poller. If you need to act on events (mail, etc.) write occurs rules for the IP Poller events (down/up). *Robert
6044.2		CSC32::M_EVANS	skewered shitake	`Wed Jul 13 1994 16:00`	10
	having worked with Richard (and a host of customers who don't want to use the ip reachabilty poller) I can see what the problem appears to be. It appears that the udp timeouts and retries don't work and that we are left at the defaul 20 seconds from the ip transport, at least with UCX. Out of curiosity how do we call "ping" from the alarm if this is what we are doing? meg
6044.3	IP poller is per domain based	VNZV01::GONZALEZ		`Thu Jul 14 1994 11:37`	16
	Hi, again re:.1 I realize that migth be a synchronization problem here, but I need to change the color of the icons according to the severity of the fault, and I don't know if there is another way to do it besides having two differents alarms. If you know a way, please let me know it. The reason for not using the IP POLLER is that as far as I know I have to enable it for each domain once I get into IMPM, I won't do it and more important, customer won't like that. Again, If you know another way to do that, let me know. Richard
6044.4	MCC DOESN'T RESPECT MY TIMEOUTS	VNZV01::GONZALEZ		`Thu Jul 14 1994 11:54`	35
	Hi, re: .2 I went to customer with my super IRIS (thanks Mitch and Chris for it) to observe and collect some information about the pings in the network. I realized some interesting things there. First, Event though I set the ICMP Timeout to 10 (seconds according to the documentation), MCC doesn't wait that time to declare the packet lost, actually something after 10 to 15 msecs (milliseconds) it sends the retries and it doesn't wait for retries timeout either. Even more, I set the ICMP Timeout to 59 in MCC FCL and tried 'SHOW SNMP .SNMP.IPHOSTS ALL STATUS' (in the same session, of course) to one node I knew was not reachable and then it came after 5 or 6 or even 2 seconds saying the IPHOSTS node was not accessable. I don't think, and I don't want to believe this is the way It is supposed to work. We (Digital) says that we have a timeout settable. Can someone explains this behavior to me and customers ? What can be wrong ? Thanks in advance for replies to this topic. Richard PS: I have the Iris data available, if someone wants to look at it.
6044.5		CSC32::M_EVANS	skewered shitake	`Thu Jul 14 1994 12:32`	6
	Richard, Can you send me that information, I will add it to your IPMT case. You are confirming something I have suspected for a while. meg
6044.6	DATA LOCATION	VNZV01::GONZALEZ		`Thu Jul 14 1994 15:16`	23
	Hi Meg, At the moment when I used Iris I just had my floppy so I saved 45K of information only. That information is available at : VNZV01::PING.DAT. If you have problems accessing this file, let me know. I'm going to customer now. Probably I can have more data by the end of the day. I will make that one available as VNZV01::PING1.DAT by tomorrow morning. This data is IRIS data, which I think you can load with other protocol analyzers, but I don't know how to. Anyway IRIS is more than enough. I hope this to be useful. Thanks. Richard
6044.7	'Known Problem'	BIKINI::KRAUSE	CSC Network Management/Hubs	`Fri Jul 15 1994 03:39`	17
	I've already opened an IPMT for the timeout problem (MGO100555) some time ago. Meanwhile a slightly different problem has been fixed (now ignores redirect messages and other 'noise'). I already sent traces, but maybe your additional traces could help engineering to find the bug. The IP Poller works ok - that's why I suggested to use it instead. You should be able to start the poller from a batch job to have it running all the time. But then you won't get the current status on the screen when you later start the IM. A viable solution (or should I say workaround?) is to run the two alarm rules once after starting the IM. This could be done in batch as well, e.g. have a command procedure that submits a job to run the rules /after=+00:02 and then starts the IM. I wish there were a command to tell the IP Poller to send a complete status and this command could be coupled with IM startup. Am I dreaming? *Robert
6044.8	MCC, UCX, IRIS, NEW DATA	VNZV01::GONZALEZ		`Fri Jul 15 1994 20:16`	39
	Hi fellows, As I told you I went to customer and this is what I found If I disable all the alarms and enable just a couple of them which controls a device in the same LAN , MCC respects the timeout, but where all the alarms are enabled and running it doesn't. I couldn't make any test in the WAN because all the hosts were OK. In the file VNZV01::PING1.DAT I have put IRIS filtered data that refers just to one device in the WAN (name 8230_POZ1=00-80-21-00-94-d1). BTW, I have also put the file dwnodes.ini where I have defined the names with their hardware address. If someone see that file with Iris will realize that the alarms was polling every 1 minute as expected (If I don't remember bad, anyway it's easy to see), and as I have two alarms checking for ipreachability, you will see that the pings are sent in groups of two. But, there are some where you'll see MCC (I don't have the address, but it is 1.110) sending up to 3 packets no waiting for the 20 seconds timeout I have set before. Observe that all the pings got their answer but it seems that when the answer arrived MCC was not waiting for them any longer and actually MCC fired the alarm. I couldn't reproduce this again, but I would say I saw that seconds after disabling all the alarms Iris was still seeing ping packets in the network. Is it possible that we have a synchronization problem between MCC and UCX ? Is it possible that MCC respects the timeouts but UCX is very busy doing others things (btw, I increase the number of sockets before) ? Thank you very much for the support with this problem. Richard