[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

6044.0. "SNMP ALARMS NOT WORKING AS EXPECTED" by VNZV01::GONZALEZ (el mas astuto) Mon Jul 11 1994 21:02


Hi all,

I've opened a call in Colorado regarding some alarms not working 
    as expected, but I'm getting out of schedule with this problem, so I
    decided to try with the field.

This text is part of one mail I sent to the person working with me so it could
contain some references to other mails or conversations.  I apologize for it.

This is the Description of the problem :

I have configured a couple of alarms that polls the ip reachability of some 
SNMP entities (actually, they are only ip nodes in general, as they're not 
configured to respond snmp queries from my network station).

One of the alarms polls the snmp entities checking for CHANGE OF IPReachability
changes from whatever state it is to UP, clearing (getting the icon blue) any
alarm condition it could have before.

The second alarm polls the snmp entities checking for CHANGE OF IPReachability 
changes from whatever state it is to DOWN, firing with some critical condition.

The problem is with the behavior of the alarm.

I must state that the problem is not specific, as it has happened with 
all the remote nodes.

It is also true, that some (few) times it behave as expected.

I wouldn't expect, because of the configuration of the alarm, two consecutive
firing staying IPREACHABILITY DOWN, as they are happening.  The system must 
detect that the system became reachable before it can fire an alarm saying 
that it got unreachable again.

The system says that IPREACHABILITY is UP several times before saying that
the SNMP node got unreachable again.

Beside, and more important though, is that non host get unreachable 
according to the UCX> PING <TCPIP_NAME> /ALL.

The Customer has remote sites connected through asynchronous connection.  The 
remote snmp hosts (SYNOPTICS AND NEWBRIDGES HUBS) are the ones 
that are seeing the problem. 

Originally, they were configured to polls every 30 seconds, what can be 
considered to be too short, so I changed the ratio to 2 minutes.

That didn't fix the problem.

I read some notes stating that there's a known problem with DEC TCP/IP SERVICES 
FOR VMS v2.0 software, as when you execute more that one "ping" 
simultaneously it got confused and respond that some hosts are reachable when 
actually they are not.

Given this situation I installed DEC TCP/IP SERVICES FOR VMS V3.1 which is 
supposed to fix the problem, but the problem remains the same.

I hope this mail clarify completely the situation I found.

I hope somebody can help me with this, as I know there are some notes stating
the same.

Have someone fixed this up ?

Thanks in advance.


Richard Gonz�lez

These are examples of the rules :

Domain LOCAL_NS:.domain.pto_ordaz Rule SNMP_REACHABILITY_UP
AT 22-JUN-1994 16:31:52 Characteristics

Examination of attributes shows:
	Alarm Fired Procedure = DKA100:[MCC]MCC_ALARMS_MAIL_ALARM.COM;2
	Alarm Exception Procdure = DKA100:[MCC]MCC_ALARMS_MAIL_EXCEPTION.COM;2
	Category = "SNMP"
	Batch Queue = "SYS$BATCH"
	Alarm Fired Parameters = "system"
	Expression = (CHANGE_OF (SNMP * IPREACHABILITY, *,UP), at every 
	00:02:00.0)
	Severity = Clear
	Probable Cause = Unknown

The rule SNMP_REACHABILITY_DOWN is similar except "*,up" is "*,down"


BTW, OpenVMS VAX V5.5-2, Polycenter NM200 v1.3.
T.RTitleUserPersonal
Name
DateLines
6044.1Use IP Reachability PollerBIKINI::KRAUSECSC Network Management/HubsTue Jul 12 1994 12:5126
>One of the alarms polls the snmp entities checking for CHANGE OF IPReachability
>changes from whatever state it is to UP, clearing (getting the icon blue) any
>alarm condition it could have before.
>
>The second alarm polls the snmp entities checking for CHANGE OF IPReachability 
>changes from whatever state it is to DOWN, firing with some critical condition.

Possibly a dumb question, but why on earth don't you use the IP 
Reachability Poller??? It has been designed to do just this and it 
works well.

>I wouldn't expect, because of the configuration of the alarm, two consecutive
>firing staying IPREACHABILITY DOWN, as they are happening.  The system must 
>detect that the system became reachable before it can fire an alarm saying 
>that it got unreachable again.
>
>The system says that IPREACHABILITY is UP several times before saying that
>the SNMP node got unreachable again.

Careful, lad! You are running two independent alarm rules. Those two 
rules are not synchronized in any way. If you have several 'downs' 
between samples this could explain what you see. Again: use the IP 
Poller. If you need to act on events (mail, etc.) write occurs rules for 
the IP Poller events (down/up).

*Robert
6044.2CSC32::M_EVANSskewered shitakeWed Jul 13 1994 17:0010
    having worked with Richard (and a host of customers who don't want to
    use the ip reachabilty poller) I can see what the problem appears to
    be.  It appears that the udp timeouts and retries don't work and that
    we are left at the defaul 20 seconds from the ip transport, at least
    with UCX.  
    
    Out of curiosity how do we call "ping" from the alarm if this is what
    we are doing?  
    
    meg
6044.3IP poller is per domain basedVNZV01::GONZALEZThu Jul 14 1994 12:3716
    
    Hi, again
    
    re:.1
    
     I realize that migth be a synchronization problem here, but I
    need to change the color of the icons according to the severity of the
    fault, and I don't know if there is another way to do it besides having
    two differents alarms.  If you know a way, please let me know it.
    
    The reason for not using the IP POLLER is that as far as I know I have
    to enable it for each domain once I get into IMPM, I won't do it and
    more important, customer won't like that.  Again, If you know another
    way to do that, let me know.
    
    Richard
6044.4MCC DOESN'T RESPECT MY TIMEOUTSVNZV01::GONZALEZThu Jul 14 1994 12:5435
    
    Hi,
    
    re: .2
    
    I went to customer with my super IRIS (thanks Mitch and Chris for it)
    to observe and collect some information about the pings in the network.
    
    I realized some interesting things there.
    
    First, Event though I set the ICMP Timeout to 10 (seconds according to
    the documentation), MCC doesn't wait that time to declare the packet
    lost,  actually something after 10 to 15 msecs (milliseconds) it sends
    the retries and it doesn't wait for retries timeout either.
    
    Even more, I set the ICMP Timeout to 59 in MCC FCL and tried 
    'SHOW SNMP .SNMP.IPHOSTS ALL STATUS' (in the same session, of course) to 
    one node I knew was not reachable and then it came after 5 or 6 or even
    2 seconds saying the IPHOSTS node was not accessable. 
    
    
    I don't think, and I don't want to believe this is the way It is
    supposed to work.
    
    We (Digital) says that we have a timeout settable. 
    
    Can someone explains this behavior to me and customers ?
    
    What can be wrong ?
    
    Thanks in advance for replies to this topic.
    
    Richard
    
    PS: I have the Iris data available, if someone wants to look at it.
6044.5CSC32::M_EVANSskewered shitakeThu Jul 14 1994 13:326
    Richard,
    
    Can you send me that information, I will add it to your IPMT case.
    You are confirming something I have suspected for a while.
    
    meg
6044.6DATA LOCATIONVNZV01::GONZALEZThu Jul 14 1994 16:1623
    
    Hi Meg,
    
    At the moment when I used Iris I just had my floppy so I saved 45K of
    information only.
    
    That information is available at : VNZV01::PING.DAT.
    
    If you have problems accessing this file, let me know.
    
    I'm going to customer now.  Probably I can have more data by the
    end of the day. I will make that one available as VNZV01::PING1.DAT by
    tomorrow morning.
    
    This data is IRIS data, which I think you can load with other protocol
    analyzers, but I don't know how to. Anyway IRIS is more than enough.
    
    I hope this to be useful.
    
    Thanks.
    
    Richard
    
6044.7'Known Problem'BIKINI::KRAUSECSC Network Management/HubsFri Jul 15 1994 04:3917
I've already opened an IPMT for the timeout problem (MGO100555) some
time ago. Meanwhile a slightly different problem has been fixed (now
ignores redirect messages and other 'noise'). I already sent traces, but
maybe your additional traces could help engineering to find the bug.

The IP Poller works ok - that's why I suggested to use it instead. You 
should be able to start the poller from a batch job to have it running 
all the time. But then you won't get the current status on the screen 
when you later start the IM. A viable solution (or should I say 
workaround?) is to run the two alarm rules once after starting the IM. 
This could be done in batch as well, e.g. have a command procedure that 
submits a job to run the rules /after=+00:02 and then starts the IM.

I wish there were a command to tell the IP Poller to send a complete 
status and this command could be coupled with IM startup. Am I dreaming?

*Robert
6044.8MCC, UCX, IRIS, NEW DATAVNZV01::GONZALEZFri Jul 15 1994 21:1639
    
    Hi fellows,
    
    As I told you I went to customer and this is what I found
    
    If I disable all the alarms and enable just a couple of them which
    controls a device in the same LAN , MCC respects the timeout, but where
    all the alarms are enabled and running it doesn't.
    
    I couldn't make any test in the WAN because all the hosts were OK.
    
    In the file VNZV01::PING1.DAT I have put IRIS filtered data that refers
    just to one device in the WAN (name 8230_POZ1=00-80-21-00-94-d1).
    
    BTW, I have also put the file dwnodes.ini where I have defined the
    names with their hardware address.
    
    If someone see that file with Iris will realize that the alarms was
    polling every 1 minute as expected (If I don't remember bad, anyway 
    it's easy to see), and as I have two alarms checking for ipreachability,
    you will see that the pings are sent in groups of two.
    
    But, there are some where you'll see MCC (I don't have the address, but
    it is 1.110) sending up to 3 packets no waiting for the 20 seconds
    timeout I have set before.  Observe that all the pings got their answer
    but it seems that when the answer arrived MCC was not waiting for them
    any longer and actually MCC fired the alarm.
    
    I couldn't reproduce this again, but I would say I saw that seconds
    after disabling all the alarms Iris was still seeing ping packets in
    the network.
      Is it possible that we have a synchronization problem between MCC 
    and UCX ? 
      Is it possible that MCC respects the timeouts but UCX is very busy
    doing others things (btw, I increase the number of sockets before) ?
    
    Thank you very much for the support with this problem.
    
    Richard