T.R | Title | User | Personal Name | Date | Lines |
---|
6165.1 | IP Poller Value Questions | TAEC::IRIBE | | Mon Nov 14 1994 10:19 | 34 |
| 1) Is there a way to see what polling values are currently in use?
>>>>>> No way to monitor the polling value !!!!
2) When one runs the poller enable procedure and does not specify
values, the variables are written to the action file with no
values. If the poller reads an action file with no values
associated with the variables, what effect will this have on the
poller?
- None?
>>>>> You are right >>>>>> - Reset the polling values to default?
>>>>> But you can modify these value : re-enable the poller with new values.
3) The polling values we have come up with that minimize false
alarms of IP non-reachibility are:
Interval - 90
Retry - 5
Timeout - 15
Are these reasonable? Are there recommended values for polling
300+ systems?
>>>> The following explain how to tune the polling period according the polling
values:
>>>> Polling_period_maxi = Retry*Timeout*number_of_systems
>>>> In this case we need a polling period equal to : 6hours and 25 mns.
>>>> The best way is to define for instance :
>>>> Retries=2 * Timeout=5s * machines=300 = polling period = 50s.
A Cioa JMI.
|
6165.2 | Customer HAS to poll < every 2 minutes | CUJO::BROWN | Dave Brown | Mon Nov 14 1994 14:16 | 29 |
|
Given the formula:
(Quantity of systems)*(Retries)*(Timeout) = Interval
I see no way to be able to poll each system < every 2 minutes. It
is very time-critical for the customer to know if a system goes down
and they want to know about it no later than 2 minutes after the event.
The poller values I mentioned in the base note seem to be working OK.
Question 1 - What are the implications of sticking with the values:
Interval - 90
Retry - 5
Timeout - 15
Question 2 - Is there a better way to check the IP reachability of
300+ systems every < 2 minutes? The customer bought MCC and a fleet of
VAXstation 4000 model 90s with the intent of performing IP Polling and
they are not going to like it if they are told that they cannot get
their < 2 minute resolution.
Any ideas would be greatly appreciated.
Thank You,
Dave
|
6165.3 | Help! | CUJO::BROWN | Dave Brown | Wed Nov 16 1994 11:19 | 22 |
|
DEC faces a major embarassment over the poller issue as documented
in this note. The customer has had it with MCC and is seriously
considering going over to HP Openview. I would implore those who read
this note to consider methods by which we can truly poll ~300 systems
every 2 minutes. Otherwise, we may have to throw in the towel.
Additionally, the customer is complaining that the MCC poller is
slow in reporting an IP unreachibility. This makes sense seeing as how
we have the polling interval set to 90 seconds and according to my
understanding, the IP Poller will only poll 50 machines per polling
interval. So with a 90 second polling interval, every machine gets
polled 6*90 seconds = 540 seconds = 9 minutes. Am I correct? When we
let the polling interval default to 30 and the retry and the timeout to
go to default, the IP Poller is continuously issuing false IP
reachibility problems.
Any help/advice would be greatly appreciated.
Dave
|
6165.4 | Poller Performance | CUJO::BROWN | Dave Brown | Wed Nov 16 1994 13:28 | 24 |
|
The customer's ~300 polled machines are members of two domains, one a
child of the other. Given that situation, the customer has asked more
questions regarding the poller:
1) When the poller is enabled/disabled for the parent domain, is
the poller also enabled/disabled for the child domain?
2) Is it possible to have two poller processes going at the same
time; one enabled for each domain. The reason this would be
considered is to share the load and improve the MCC response
time should a node become unrechible. Currently, they are
getting notification up to 6+ minutes after a node goes down.
3) What would be the effect if multiple polling domains were set up
and the poller/pollers were individually enabled for each
domain? Will the poller poll 50 systems per domain per polling
interval or just 50 per poller process per polling interval?
Thank You,
Dave
|
6165.5 | | TAEC::IRIBE | | Fri Nov 18 1994 11:40 | 48 |
| 1) When the poller is enabled/disabled for the parent domain, is
the poller also enabled/disabled for the child domain?
>>> Yes you are right.
2) Is it possible to have two poller processes going at the same
time; one enabled for each domain. The reason this would be
considered is to share the load and improve the MCC response
time should a node become unrechible. Currently, they are
getting notification up to 6+ minutes after a node goes down.
>>>> No it is impossible to have 2 pollers.
3) What would be the effect if multiple polling domains were set up
and the poller/pollers were individually enabled for each
domain? Will the poller poll 50 systems per domain per polling
interval or just 50 per poller process per polling interval?
>>>> When you talk about 50 machines polled per polling period. In fact you we
can poll ((nb_machines)/50 * timeout * retry )per the polling period.
I want to say you that I have run tests (with the OSF version). Here in Valbonne
I have ~360 machines and I set the polling period to 18s, 1 retry, and 1s of
timeout. All is right, no incoherent IPreachability.
I tried to ping some machines in US it took ~300ms. If we can imagine that the
poller could take 2*300ms for the most far location (ICMP ping takes more time
than a IP ping).
I don't think you could have any pb to poll you 300 machines.
So you can define the following parameters :
polling period = 120s
retry = 1
timeout = 2s
An idea other (if there is no solution, becaus I don't think that it could be
very clean) :
May be,you can write a litle programm which is able to poll some IP machines
(use the ping command) and if the machine is unreachable send an event with the
mcc_evc_send to the collection_am.
JMI
|
6165.6 | How many nodes does the Poller really poll? | CUJO::BROWN | Dave Brown | Mon Nov 21 1994 13:28 | 35 |
|
Thank you for the response to my previous questions, there is still
something I don't understand. How many systems in the domain get polled per
polling interval? Do all of them get polled per polling interval or does
only a subset get polled?
The following extract from .5 suggests that only a portion get polled:
>>>> When you talk about 50 machines polled per polling period. In fact you we
>>>> can poll ((nb_machines)/50 * timeout * retry )per the polling period.
If we have 300 machines, and timeout was 2 and retry was 1, the
quantity of machines polled each polling interval would be:
300/50*2*1 = 12
Does this mean that only 12 machines get polled per polling interval?!
If this is so and polling interval is 120 and our total machines are 300, we
cycle through the entire list of 300 machines every (25*120) = 3000 seconds
or 50 minutes. This would mean a worst case of a 50 minute latency between a
node becoming unreachible and us getting an MCC notification.
Is this how it works?
*- OR -*
Does the Poller poll *ALL* the machines in one polling interval
therefore making it possible to get a < 2 minute worst case notification
latancy from *-ANY-* IP unreachibility?
Thank you,
Dave
|
6165.7 | | MOLAR::MOLAR::BOSE | | Tue Nov 29 1994 11:47 | 32 |
|
Dave,
First let me tell you that even HP Openview cannot solve
your polling problem. I worked on the IP Poller originally and now
I am working with Netview (based on HP Openview), and I can assure
you that Netview will not report on the status of unreachable nodes
any faster.
Now, let's get the math straight. The poller has no limit
on the number of nodes it can poll. But the time taken to poll all
the nodes will vary depending on how many nodes you are trying to
poll. So, in the worst case, when all the nodes are unreachable,
the time taken to poll 300 nodes with 10 sec timeout, and 2 retries
will be
300/50 * 10 * (2+1) = 180 sec = 3 min.
The math is pretty straight forward. In the worst case all
nodes are unreachable, so there will be timeout and retries for each
node. So to determine a node is unreachable it will take 10 seconds
* (2 + 1). A retry value of two means the ping request is sent out
out a total of three times. So, it takes 30 seconds to know that a node
is unreachable. For 300 nodes, the time taken is 300 * 30 sec. Since the
poller sends out 50 ping requests in one shot, the actual time taken
will be 300 * 30 /50 = 180 sec.
However if only 20% of the nodes are unreachable, the time
taken to poll 300 would be less than a minute with 10 s timeout and
2 retries.
Rahul.
|
6165.8 | Fine detail poller actions | CUJO::BROWN | Dave Brown | Wed Nov 30 1994 12:33 | 91 |
|
Rahul,
What we're trying to establish is how the polls operate within the
TIMEOUT*(RETRY+1) time period. I understand that systems are polled in
groups of 50 until all the systems in the domain have been polled. The
question is, what is the frequency of the successive 50 system poll
groups?
By your example in .7 -
Systems = 300
RETRY = 2
TIMEOUT = 10
POLL_INTERVAL = 200
Is this how it works?:
Cumulative
Systems Systems Action
Second Polled Polled Taken
------ ------- -------- ------
0 50 50 First 50 polled. Waits TIMEOUT*(RETRY+1)sec.
30 50 100 Second 50 polled. Waits TIMEOUT*(RETRY+1)sec.
60 50 150 Third 50 polled. Waits TIMEOUT*(RETRY+1)sec.
90 50 200 Fourth 50 polled. Waits TIMEOUT*(RETRY+1)sec.
120 50 250 Fifth 50 polled. Waits TIMEOUT*(RETRY+1)sec.
180 50 300 Final 50 polled. Waits for next POLL_INTERVAL
200 50 50 Next POLL_INTERVAL; starts over.
According to my understanding, this is why:
Total Systems
------------- * TIMEOUT * (RETRY+1) must be less than POLL_INTERVAL
50
If it is not, you wil have poll overrun; the condition where polls
are still ocurring from the last POLL_INTERVAL when another
POLL_INTERVAL starts.
.7 implies that my example above is correct for worst case only. IF
that IS true, what I don't understand then is how it works if a
percentage to all nodes are IP Reachible.
Does the next group of 50 polls not wait for (TIMEOUT*(RETRY+1)) seconds
but kick off immediatly only after all of the current group of 50 have
responded?
Or within the (TIMEOUT*(RETRY+1)) time period, if 30 out of 50
immediatly respond, is another 30 polls released at this time thereby
keeping the level of unacknowledged polls at 50?
Or within the (TIMEOUT*(RETRY+1)) time period, if all 50 nodes
respond within TIMEOUT seconds, are the next 50 polls released at the
next TIMEOUT second mark? Example -
All nodes IP Reachible -
TIMEOUT = 10
RETRY = 2
POLL_INTERVAL = 200
Total Systems = 300
Total
Second Systems Polled Systems Polled
------ -------------- --------------
0 50 50
10 50 100
20 50 150
30 50 200
40 50 250
50 50 300
60 0 300
. . .
. . .
200 50 50
210 50 100
. . .
. . .
Is this how it works? These questions are all brought up by my
customer who has a very inquiring mind and who would like to explain
the actions that they have witnesses the poller take given a certain
percentage of IP Reachibilities.
Any help would be appreciated.
Thanks!
Dave
|
6165.9 | | MOLAR::MOLAR::BOSE | | Fri Dec 02 1994 10:09 | 34 |
|
>> According to my understanding, this is why:
>> Total Systems
>> ------------- * TIMEOUT * (RETRY+1) must be less than POLL_INTERVAL
50
>> If it is not, you wil have poll overrun; the condition where polls
>> are still ocurring from the last POLL_INTERVAL when another
>> POLL_INTERVAL starts.
If the time taken to poll all the nodes is greater than the polling
interval, then that time is regarded as the new polling interval. So,
if the polling interval is too small, then there might be continuous
polling of the nodes.
>> Does the next group of 50 polls not wait for (TIMEOUT*(RETRY+1)) seconds
>> but kick off immediatly only after all of the current group of 50 have
>> responded?
Yes. So if all your nodes respond immedialtely, polling all the nodes
will take next to no time.
>> Or within the (TIMEOUT*(RETRY+1)) time period, if 30 out of 50
>> immediatly respond, is another 30 polls released at this time thereby
>> keeping the level of unacknowledged polls at 50?
Before the next batch of ICMP requests are sent out, retries are
attempted on the 20 nodes which didn't respond. Only when all the
retries are exhausted do we send out the next batch of 50.
Rahul.
|
6165.10 | Things are getting weirder... | CUJO::BROWN | Dave Brown | Fri Dec 02 1994 17:06 | 27 |
| Rahul,
Thanks for the update, I'll pass it onto the customer. Meanwhile, the
customer would like me to explain to him why when he took polling
interval from 45 seconds to 180, they got some IP unreachiblities
followed by IP reachiblities on a continuous cyclical basis. When they
raised RETRY from 2 to 3, the complaints of the nodes stopped.
Then they started playing with different values of POLLING_INTERVAL
and noticed that the nodes that would take the IP unreachible/IP
reachible hits were nodes that were physically adjacent in the DNS
namespace extension. Raise the value of POLL_INTERVAL a little bit and
the group IP unreachible/IP reachible nodes would change to another
group which was physically adjacent in DNS; a but further down the
extension. They will be providing me with a cause and effect matrix
which I will place here.
As you can tell, this customer is quite inquiring. They change the
poller variables, observe the unfavorable result and then look in the
namespace to try to find out what is happening. The customer would not
be doing this if the Poller actions were stable (in their mind). They
(and I) are having quite a bit of difficulty establishing the root
reasons behind the cause and effect we are seeing with just slightly
changing the Poller variables - we're trying to find variables which
will stablize the Poller.
Dave
|
6165.11 | | MOLAR::MOLAR::BOSE | | Tue Dec 06 1994 17:09 | 18 |
|
>> interval from 45 seconds to 180, they got some IP unreachiblities
>> followed by IP reachiblities on a continuous cyclical basis. When they
Can't explain why that would happen.
>> When they
>> raised RETRY from 2 to 3, the complaints of the nodes stopped.
When a network is flaky, packets may be lost. Increasing the number
of retries will cause the behaviour to stabilise. There was also a
problem in the poller where the socket buffer was too small and
packets were being lost. But I fixed that problem and it should
have been available as a patch on V1.3.
Rahul.
|
6165.12 | Stopping the Poller Process | CUJO::BROWN | Dave Brown | Wed Dec 07 1994 17:31 | 26 |
|
How about this question - what is the approved method of stopping
the IP Poller process?
The reason the question is asked is that once the MCC_POLLER_ENABLE
procedure is run, the poller process is up and remains for how long the
system is booted regardless if MCC is shut down or not. These folks do
some interesting MCC maintenance actions, such as propagating the MCC
dictionaries, and like to make sure that all MCC processes are off the
system before doing things like this. Historically, they have been
ridding themselves of the MCC_IP_POLLER process by doing a STOP/ID on
it but they have recently made the coorelation of doing this with many
IP rechibility problems once MCC is restarted and the poller reenabled.
The only way they can clear this problem, once the poller has been
STOP/ID'd, is to reboot.
We tried making an action.dat file with only an "exit" statement in
it hoping the poller would get the idea we wanted it to go away; it did
not.
So is there a way to make the poller process exit gracefully?
Thanks,
Dave
|
6165.13 | .12?? | CUJO::BROWN | Dave Brown | Wed Dec 14 1994 13:00 | 11 |
|
No response to .12 would suggest it's a good question. The customer
is still wanting an answer should anyone have one. Is there anything we
can put in the action.dat file to cause the poller to exit gracefully?
Like stated in .12, doing a STOP/ID on the poller causes the next
poller process to work improperly; the only apparent fix being a
reboot.
Thanks!
Dave
|