[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

6165.0. "IP Poller Value Questions" by CUJO::BROWN (Dave Brown) Thu Nov 10 1994 13:51

    
    	I've been working with a customer on the IP Poller and we have come
    up with some questions that we can not find the answers to:
    
    	1) Is there a way to see what polling values are currently in use?
    
    	2) When one runs the poller enable procedure and does not specify
    	   values, the variables are written to the action file with no
    	   values. If the poller reads an action file with no values
    	   associated with the variables, what effect will this have on the
    	   poller? 
    
    		- None?
    
    		- Reset the polling values to default?
    		
    	3) The polling values we have come up with that minimize false
    	   alarms of IP non-reachibility are:
    
    		Interval - 90
    		Retry    -  5
    		Timeout	 - 15
    
    	   Are these reasonable? Are there recommended values for polling 
    	   300+ systems?
                            
    
    
    	Thanks,
    
    	Dave

T.R	Title	User	Personal Name	Date	Lines
6165.1	IP Poller Value Questions	TAEC::IRIBE		`Mon Nov 14 1994 10:19`	34
	1) Is there a way to see what polling values are currently in use? >>>>>> No way to monitor the polling value !!!! 2) When one runs the poller enable procedure and does not specify values, the variables are written to the action file with no values. If the poller reads an action file with no values associated with the variables, what effect will this have on the poller? - None? >>>>> You are right >>>>>> - Reset the polling values to default? >>>>> But you can modify these value : re-enable the poller with new values. 3) The polling values we have come up with that minimize false alarms of IP non-reachibility are: Interval - 90 Retry - 5 Timeout - 15 Are these reasonable? Are there recommended values for polling 300+ systems? >>>> The following explain how to tune the polling period according the polling values: >>>> Polling_period_maxi = RetryTimeoutnumber_of_systems >>>> In this case we need a polling period equal to : 6hours and 25 mns. >>>> The best way is to define for instance : >>>> Retries=2 * Timeout=5s * machines=300 = polling period = 50s. A Cioa JMI.
6165.2	Customer HAS to poll < every 2 minutes	CUJO::BROWN	Dave Brown	`Mon Nov 14 1994 14:16`	29
	Given the formula: (Quantity of systems)(Retries)(Timeout) = Interval I see no way to be able to poll each system < every 2 minutes. It is very time-critical for the customer to know if a system goes down and they want to know about it no later than 2 minutes after the event. The poller values I mentioned in the base note seem to be working OK. Question 1 - What are the implications of sticking with the values: Interval - 90 Retry - 5 Timeout - 15 Question 2 - Is there a better way to check the IP reachability of 300+ systems every < 2 minutes? The customer bought MCC and a fleet of VAXstation 4000 model 90s with the intent of performing IP Polling and they are not going to like it if they are told that they cannot get their < 2 minute resolution. Any ideas would be greatly appreciated. Thank You, Dave
6165.3	Help!	CUJO::BROWN	Dave Brown	`Wed Nov 16 1994 11:19`	22
	DEC faces a major embarassment over the poller issue as documented in this note. The customer has had it with MCC and is seriously considering going over to HP Openview. I would implore those who read this note to consider methods by which we can truly poll ~300 systems every 2 minutes. Otherwise, we may have to throw in the towel. Additionally, the customer is complaining that the MCC poller is slow in reporting an IP unreachibility. This makes sense seeing as how we have the polling interval set to 90 seconds and according to my understanding, the IP Poller will only poll 50 machines per polling interval. So with a 90 second polling interval, every machine gets polled 6*90 seconds = 540 seconds = 9 minutes. Am I correct? When we let the polling interval default to 30 and the retry and the timeout to go to default, the IP Poller is continuously issuing false IP reachibility problems. Any help/advice would be greatly appreciated. Dave
6165.4	Poller Performance	CUJO::BROWN	Dave Brown	`Wed Nov 16 1994 13:28`	24
	The customer's ~300 polled machines are members of two domains, one a child of the other. Given that situation, the customer has asked more questions regarding the poller: 1) When the poller is enabled/disabled for the parent domain, is the poller also enabled/disabled for the child domain? 2) Is it possible to have two poller processes going at the same time; one enabled for each domain. The reason this would be considered is to share the load and improve the MCC response time should a node become unrechible. Currently, they are getting notification up to 6+ minutes after a node goes down. 3) What would be the effect if multiple polling domains were set up and the poller/pollers were individually enabled for each domain? Will the poller poll 50 systems per domain per polling interval or just 50 per poller process per polling interval? Thank You, Dave
6165.5		TAEC::IRIBE		`Fri Nov 18 1994 11:40`	48
	1) When the poller is enabled/disabled for the parent domain, is the poller also enabled/disabled for the child domain? >>> Yes you are right. 2) Is it possible to have two poller processes going at the same time; one enabled for each domain. The reason this would be considered is to share the load and improve the MCC response time should a node become unrechible. Currently, they are getting notification up to 6+ minutes after a node goes down. >>>> No it is impossible to have 2 pollers. 3) What would be the effect if multiple polling domains were set up and the poller/pollers were individually enabled for each domain? Will the poller poll 50 systems per domain per polling interval or just 50 per poller process per polling interval? >>>> When you talk about 50 machines polled per polling period. In fact you we can poll ((nb_machines)/50 * timeout * retry )per the polling period. I want to say you that I have run tests (with the OSF version). Here in Valbonne I have ~360 machines and I set the polling period to 18s, 1 retry, and 1s of timeout. All is right, no incoherent IPreachability. I tried to ping some machines in US it took ~300ms. If we can imagine that the poller could take 2*300ms for the most far location (ICMP ping takes more time than a IP ping). I don't think you could have any pb to poll you 300 machines. So you can define the following parameters : polling period = 120s retry = 1 timeout = 2s An idea other (if there is no solution, becaus I don't think that it could be very clean) : May be,you can write a litle programm which is able to poll some IP machines (use the ping command) and if the machine is unreachable send an event with the mcc_evc_send to the collection_am. JMI
6165.6	How many nodes does the Poller really poll?	CUJO::BROWN	Dave Brown	`Mon Nov 21 1994 13:28`	35
	Thank you for the response to my previous questions, there is still something I don't understand. How many systems in the domain get polled per polling interval? Do all of them get polled per polling interval or does only a subset get polled? The following extract from .5 suggests that only a portion get polled: >>>> When you talk about 50 machines polled per polling period. In fact you we >>>> can poll ((nb_machines)/50 * timeout * retry )per the polling period. If we have 300 machines, and timeout was 2 and retry was 1, the quantity of machines polled each polling interval would be: 300/5021 = 12 Does this mean that only 12 machines get polled per polling interval?! If this is so and polling interval is 120 and our total machines are 300, we cycle through the entire list of 300 machines every (25120) = 3000 seconds or 50 minutes. This would mean a worst case of a 50 minute latency between a node becoming unreachible and us getting an MCC notification. Is this how it works? - OR -* Does the Poller poll ALL the machines in one polling interval therefore making it possible to get a < 2 minute worst case notification latancy from -ANY- IP unreachibility? Thank you, Dave
6165.7		MOLAR::MOLAR::BOSE		`Tue Nov 29 1994 11:47`	32
	Dave, First let me tell you that even HP Openview cannot solve your polling problem. I worked on the IP Poller originally and now I am working with Netview (based on HP Openview), and I can assure you that Netview will not report on the status of unreachable nodes any faster. Now, let's get the math straight. The poller has no limit on the number of nodes it can poll. But the time taken to poll all the nodes will vary depending on how many nodes you are trying to poll. So, in the worst case, when all the nodes are unreachable, the time taken to poll 300 nodes with 10 sec timeout, and 2 retries will be 300/50 * 10 * (2+1) = 180 sec = 3 min. The math is pretty straight forward. In the worst case all nodes are unreachable, so there will be timeout and retries for each node. So to determine a node is unreachable it will take 10 seconds * (2 + 1). A retry value of two means the ping request is sent out out a total of three times. So, it takes 30 seconds to know that a node is unreachable. For 300 nodes, the time taken is 300 * 30 sec. Since the poller sends out 50 ping requests in one shot, the actual time taken will be 300 * 30 /50 = 180 sec. However if only 20% of the nodes are unreachable, the time taken to poll 300 would be less than a minute with 10 s timeout and 2 retries. Rahul.
6165.8	Fine detail poller actions	CUJO::BROWN	Dave Brown	`Wed Nov 30 1994 12:33`	91
	Rahul, What we're trying to establish is how the polls operate within the TIMEOUT(RETRY+1) time period. I understand that systems are polled in groups of 50 until all the systems in the domain have been polled. The question is, what is the frequency of the successive 50 system poll groups? By your example in .7 - Systems = 300 RETRY = 2 TIMEOUT = 10 POLL_INTERVAL = 200 Is this how it works?: Cumulative Systems Systems Action Second Polled Polled Taken ------ ------- -------- ------ 0 50 50 First 50 polled. Waits TIMEOUT(RETRY+1)sec. 30 50 100 Second 50 polled. Waits TIMEOUT(RETRY+1)sec. 60 50 150 Third 50 polled. Waits TIMEOUT(RETRY+1)sec. 90 50 200 Fourth 50 polled. Waits TIMEOUT(RETRY+1)sec. 120 50 250 Fifth 50 polled. Waits TIMEOUT(RETRY+1)sec. 180 50 300 Final 50 polled. Waits for next POLL_INTERVAL 200 50 50 Next POLL_INTERVAL; starts over. According to my understanding, this is why: Total Systems ------------- * TIMEOUT * (RETRY+1) must be less than POLL_INTERVAL 50 If it is not, you wil have poll overrun; the condition where polls are still ocurring from the last POLL_INTERVAL when another POLL_INTERVAL starts. .7 implies that my example above is correct for worst case only. IF that IS true, what I don't understand then is how it works if a percentage to all nodes are IP Reachible. Does the next group of 50 polls not wait for (TIMEOUT(RETRY+1)) seconds but kick off immediatly only after all of the current group of 50 have responded? Or within the (TIMEOUT(RETRY+1)) time period, if 30 out of 50 immediatly respond, is another 30 polls released at this time thereby keeping the level of unacknowledged polls at 50? Or within the (TIMEOUT*(RETRY+1)) time period, if all 50 nodes respond within TIMEOUT seconds, are the next 50 polls released at the next TIMEOUT second mark? Example - All nodes IP Reachible - TIMEOUT = 10 RETRY = 2 POLL_INTERVAL = 200 Total Systems = 300 Total Second Systems Polled Systems Polled ------ -------------- -------------- 0 50 50 10 50 100 20 50 150 30 50 200 40 50 250 50 50 300 60 0 300 . . . . . . 200 50 50 210 50 100 . . . . . . Is this how it works? These questions are all brought up by my customer who has a very inquiring mind and who would like to explain the actions that they have witnesses the poller take given a certain percentage of IP Reachibilities. Any help would be appreciated. Thanks! Dave
6165.9		MOLAR::MOLAR::BOSE		`Fri Dec 02 1994 10:09`	34
	>> According to my understanding, this is why: >> Total Systems >> ------------- * TIMEOUT * (RETRY+1) must be less than POLL_INTERVAL 50 >> If it is not, you wil have poll overrun; the condition where polls >> are still ocurring from the last POLL_INTERVAL when another >> POLL_INTERVAL starts. If the time taken to poll all the nodes is greater than the polling interval, then that time is regarded as the new polling interval. So, if the polling interval is too small, then there might be continuous polling of the nodes. >> Does the next group of 50 polls not wait for (TIMEOUT(RETRY+1)) seconds >> but kick off immediatly only after all of the current group of 50 have >> responded? Yes. So if all your nodes respond immedialtely, polling all the nodes will take next to no time. >> Or within the (TIMEOUT(RETRY+1)) time period, if 30 out of 50 >> immediatly respond, is another 30 polls released at this time thereby >> keeping the level of unacknowledged polls at 50? Before the next batch of ICMP requests are sent out, retries are attempted on the 20 nodes which didn't respond. Only when all the retries are exhausted do we send out the next batch of 50. Rahul.
6165.10	Things are getting weirder...	CUJO::BROWN	Dave Brown	`Fri Dec 02 1994 17:06`	27
	Rahul, Thanks for the update, I'll pass it onto the customer. Meanwhile, the customer would like me to explain to him why when he took polling interval from 45 seconds to 180, they got some IP unreachiblities followed by IP reachiblities on a continuous cyclical basis. When they raised RETRY from 2 to 3, the complaints of the nodes stopped. Then they started playing with different values of POLLING_INTERVAL and noticed that the nodes that would take the IP unreachible/IP reachible hits were nodes that were physically adjacent in the DNS namespace extension. Raise the value of POLL_INTERVAL a little bit and the group IP unreachible/IP reachible nodes would change to another group which was physically adjacent in DNS; a but further down the extension. They will be providing me with a cause and effect matrix which I will place here. As you can tell, this customer is quite inquiring. They change the poller variables, observe the unfavorable result and then look in the namespace to try to find out what is happening. The customer would not be doing this if the Poller actions were stable (in their mind). They (and I) are having quite a bit of difficulty establishing the root reasons behind the cause and effect we are seeing with just slightly changing the Poller variables - we're trying to find variables which will stablize the Poller. Dave
6165.11		MOLAR::MOLAR::BOSE		`Tue Dec 06 1994 17:09`	18
	>> interval from 45 seconds to 180, they got some IP unreachiblities >> followed by IP reachiblities on a continuous cyclical basis. When they Can't explain why that would happen. >> When they >> raised RETRY from 2 to 3, the complaints of the nodes stopped. When a network is flaky, packets may be lost. Increasing the number of retries will cause the behaviour to stabilise. There was also a problem in the poller where the socket buffer was too small and packets were being lost. But I fixed that problem and it should have been available as a patch on V1.3. Rahul.
6165.12	Stopping the Poller Process	CUJO::BROWN	Dave Brown	`Wed Dec 07 1994 17:31`	26
	How about this question - what is the approved method of stopping the IP Poller process? The reason the question is asked is that once the MCC_POLLER_ENABLE procedure is run, the poller process is up and remains for how long the system is booted regardless if MCC is shut down or not. These folks do some interesting MCC maintenance actions, such as propagating the MCC dictionaries, and like to make sure that all MCC processes are off the system before doing things like this. Historically, they have been ridding themselves of the MCC_IP_POLLER process by doing a STOP/ID on it but they have recently made the coorelation of doing this with many IP rechibility problems once MCC is restarted and the poller reenabled. The only way they can clear this problem, once the poller has been STOP/ID'd, is to reboot. We tried making an action.dat file with only an "exit" statement in it hoping the poller would get the idea we wanted it to go away; it did not. So is there a way to make the poller process exit gracefully? Thanks, Dave
6165.13	.12??	CUJO::BROWN	Dave Brown	`Wed Dec 14 1994 13:00`	11
	No response to .12 would suggest it's a good question. The customer is still wanting an answer should anyone have one. Is there anything we can put in the action.dat file to cause the poller to exit gracefully? Like stated in .12, doing a STOP/ID on the poller causes the next poller process to work improperly; the only apparent fix being a reboot. Thanks! Dave