T.R | Title | User | Personal Name | Date | Lines |
---|
681.1 | First answer -- see result of last evaluation | TOOK::ORENSTEIN | | Thu Jan 31 1991 12:56 | 15 |
| Brad,
This will answer one of your questions:
>>>The error also indicates the LAST ERROR and not the LAST POLL ERROR
>>> (please verify this thought). Therefore I can NEVER tell if this
>>> problem persists or goes away after the startup because once there
>>> is an error MCC continues to display this error forever.
The "error condition" dispays the last error encountered.
The "result of last evaluation" will be TRUE, FALSE or ERROR.
The latter will be the indicate if the problem has gone away.
aud...
|
681.2 | thanks, i'll monitor more | JETSAM::WOODCOCK | | Thu Jan 31 1991 13:51 | 9 |
| The "error condition" dispays the last error encountered.
The "result of last evaluation" will be TRUE, FALSE or ERROR.
> Thanks for pointing out the obvious which I had overlooked.
> I will retry this procedure and have the status monitored at
> relatively close intervals and report back as to whether this
> is a startup or a continuous problem as soon as I get a chance.
> brad...
|
681.3 | DNA4_AM changes should help | TOOK::CAREY | | Fri Feb 01 1991 12:06 | 90 |
|
Brad,
You've sent this to me in mail as well, and I replied to your mail this
morning.
For Event Alarms, I expect you will see better behavior with a new AM
we'll make available to you as soon as we've looked it over. Part of
our standard validation is to check out the node instance and verify
that it isn't a cluster alias. We were doing this even for event
alarms, and in if we had trouble connecting, we were almost guaranteed
to return an "internal error in DECnet Phase IV."
It doesn't make sense for us to access the specified node AT ALL for
a GETEVENT request. The Node4 need not even be up on the network when
an event request is put out against it, and it certainly isn't
important that the person collecting events be able to access the node.
So, we've modified our validation to skip this node4 verification on
GETEVENT requests. If we can figure out the DECnet address of a
specified Node4, we will allow a request for event information to be
made outstanding against it. That should take care of that set of
problems (assuming that these were event alarms).
Were the "no such entity" errors also on event alarms? I haven't seen
anything like this, and have no idea what is going on. I'd like to
see the alarms rules you were using. We'll see what we can do in this
area.
As soon as possible, I'll get you a new AM with this fix in it so that
we can check it out.
I just reviewed QAR 196 (yep, that's the one) to make sure we were
talking about the same QAR. This QAR is about a change-of rule on
circuit substate.
First, while change-of and occurs rules may be used for the same
purpose by you, they cause significantly different activity in DECmcc.
Change-of polls the node4 in question. Occurs just hangs around
waiting for event delivery.
The problem still stands that chronologically "close" alarm rules to a
DECrouter might collide with each other and cause one to fail. Where
the IFT kit was prone to return "internal errors", EFT and later kits
are a lot better about returning the connection failure as an
"insufficent resources" error. For V1.1, the only thing you can do is
separate the requests by a few seconds (say 30-60 for a loaded
network). We're considering getting smarter about this for V1.2.
There is some latency in busy DECrouters when cleaning up links that
has caused us some problems, even after the EFT DNA4_AM learned to
manage single link systems. It seems that DECrouters consider routing
their main job, and just do management when there is "spare" time
available. I can't say that I disagree with that philosophy, so we're
doing our best to be accomodating.
Recent changes to the AM that should effect your change-of
capabilities:
- Caching of node information to significantly reduce link
overhead, especially to commonly referenced systems.
- Enhancements to the single link code, especially an ugly little
error in a low level routine that could cause hundreds of
unnecessary links to be created and dropped. That little logic
error significantly increased the odds of hitting the DECrouter
latency problem, and often impacted Circuit Status requests.
Before that, the EFT kit included single-link management logic that
should significantly increase your chances of being successful with the
change-of function anyway.
I'll get you a new AM as soon as I can, and we'll see if it alleviates
these errors. I think you will find that it does.
We'll also be able to look at the no such entity error with you, to see
where it is originating. Because we manage GETEVENT differently with
this AM, we may not encounter this problem (whatever it was) with the
revised logic.
So, here's the deal:
- you get me your alarm rules.
- I'll get you the latest and greatest DNA4_AM to try out.
Thanks for your time,
-Jim Carey
|
681.4 | ready when you are | JETSAM::WOODCOCK | | Fri Feb 01 1991 15:02 | 10 |
| Hi Jim,
I'm ready for further testing whenever you are. I will send you a couple
of com files which create the alarms for your scrutiny (both change_of and
occurs). I'm glad to see attention on this one. Let me know when you have
a build available. This is on the top of the list.
thanks,
brad...
|
681.5 | One down, one deferred.... | TOOK::CAREY | | Mon Feb 04 1991 18:34 | 48 |
|
Brad,
The "occurs" problems are all cleaned up. Thanks for checking that
out for us. You can collect events from those DECrouters to your
heart's content, and I think that is the monitoring method of choice
for two reasons:
- Less CPU and network overhead watching for network trouble so
the network gets to do more of what it was made for, moving data.
- You won't have a large window between polling cycles where
you can entirely miss critical problems because DECnet is
working so hard to get around them.
For the "change-of" function, it looks like we'll have to settle for
a workaround. When enabling those alarms rules to poll every ten
minutes, start them up about thirty seconds apart. You indicated that
you'll actually be starting them up about one and a half minutes apart
to avoid race conditions by as much as possible.
If there is an existing link to the DECrouter, we are good at
recognizing and reporting that. However, you are setting up a
condition wherein up to four requests are being initiated to the poor
old single link DECrouter at the same time.
It looks like the DECrouter gets pretty well overwhelmed under these
conditions, and the DECnet AM isn't very pretty handling the errors
when they are reported. Debugging race conditions like this is
challenging, and the impact on critical code is high.
For V1.1, you'll see a restriction in the release notes that you should
be careful polling single link systems (like the whole DECrouter
family), and to space out polled requests just as you must single link
other activity to these limited systems.
For V1.2, we'd like to enhance the cache we've already built to allow
us to perform some link-queueing so that we can synchronize multiple
threads requesting service from the same limited system, and make this
a little more transparent to you.
Thanks for jumping right into this one, and turning around your testing
so quickly for us.
-Jim Carey
This will help you get even further away from the hardware.
|
681.6 | a bit more | JETSAM::WOODCOCK | | Mon Feb 04 1991 19:15 | 36 |
| Jim,
> - Less CPU and network overhead watching for network trouble so
> the network gets to do more of what it was made for, moving data.
Definitely preferred...
> - You won't have a large window between polling cycles where
> you can entirely miss critical problems because DECnet is
> working so hard to get around them.
There are cases where this method works out well. During off hours for
example when "we" can't actually watch events real time anyway. The
change_of is convenient because it lets us know when it when down and
when it came back (circuits) at reasonable intervals. If we use occurs
we'd receive mail every time a circuit bounced. Although I'll probably
look into delaying MCC from looking at the events for intervals also, and
see how that works out.
> It looks like the DECrouter gets pretty well overwhelmed under these
> conditions, and the DECnet AM isn't very pretty handling the errors
> when they are reported. Debugging race conditions like this is
> challenging, and the impact on critical code is high.
If I remember right there were instances where the "FIRST" rule to a router
also failed. This makes me wonder whether maybe both were a bit overwhelmed,
router and MCC (especially with that nasty "unknown entity error"). By
sticking in the delays we not only have allowed the router to catch up, but
MCC is also able to kick back. With one procedure all is well, but I'm curious
to see what happens when I fire up 10 of these procedures at once for across
the network monitoring.
brad...
ps. Your new AM "WAS" noticably quicker when using IMPM child lookups and
commands [thanks, :-)]
|
681.7 | further testing | JETSAM::WOODCOCK | | Tue Feb 05 1991 15:34 | 15 |
| Hi again,
I have set up 10 procedures similar to the one we just worked thru. The enable
commands for any particular router (multiple circuits) were delayed by 1
minute each. I submitted all 10 of these as seperate jobs thru a procedure
which starts them one after another. There are probably a total of 70 or so
rules. Checking the log files after startup shows a failure on 7 alarms (10%)
including one "no such entity" and the rest are "internal decnet am errors".
It seems MCC couldn't keep up. Do you think I have reached the limitations
of my platform (3520 w32M) or MCC? Would putting small delays into the jobs
being submitted help? Seems to me MCC should be able to handle this load.
Athough I'm not sure how many polls the system itself can handle.
any ideas?
brad
|
681.8 | Maybe DNS is the bottleneck? | TOOK::GUERTIN | E = mcc | Tue Feb 05 1991 20:27 | 20 |
| Brad,
Seems to me that MCC per se should be able to handle this load without
"breaking sweat". I'm guessing that the bottle neck is somewhere else.
Perhaps DNS?
Is it possible to do testing on something that doesn't use DNS so much?
For example, use a bridge name or address of a bridge which is
registered (the Bridge AM caches the information after the initial DNS
call). I'm not sure if the DNA4 AM calls DNS when you use just DNA4
addresses, but I seem to recall Jim Carey saying that it does not. So
you may want to try using DNA4 addresses as well. If MCC suddenly
"perks up", then the problem is probably overloading the DNS server.
By the way, is the DNS server on the same node?
We expect to do more thorough performance and endurance testing (and
tuning) in the near future (I hope).
-Matt.
|
681.9 | dns is local | JETSAM::WOODCOCK | | Wed Feb 06 1991 09:49 | 23 |
| Matt,
DNS is on the same system. Also, I don't have the environment to do
this testing with bridges. As far as whether this is a DNS or MCC
problem they are one in the same from my standpoint.
I have conversed via mail with Jim recently and at this point I hope
to at least identify where the breakdown may be so I can better understand
where the limitations are. I think may be beaking some new ground here
on performance of pollers (but not certain). We have tools now which have
in the past polled up to 350 nodes. These polls are broken down into 8
batches but I'm not sure how the timing is done at startup. Once started
they poll at an unrestricted rate asyncronously. Whereas what I have set
up in MCC is to start these polls with 10 batches, syncronously and with
a restricted rate of 10 minutes each.
Hopefully at some point, we can dedicate some time to get a better
understanding of the performance limitations and *what* is the limiting
factor. For now I'll tweek a bit more until it works satisfactorily
until we go to an event solution. Addresses may be a good place to start.
brad...
|
681.10 | DNS hasn't been a suspect. Everything else thoug.... | TOOK::CAREY | | Wed Feb 06 1991 11:19 | 38 |
|
Brad and I have been trying to work the issues surrounding this problem
for awhile. The DECnet AM will only try the namespace if it can't
resolve the node name using the local database. Using the address
directly guarantees that we don't use the namespace, but I don't expect
that DNS is involved in this problem at all, even using the regular
node name.
Brad, you might ensure that that is the case by verifying that the
nodes you are using are defined in your local DECnet database. Check
for them as Remote Node entities on your local MCC node.
I think DNS is probably not involved.
Some of our prime concerns:
- Four nearly simultaneous link requests from a single process to
DECnet. Likely, all four are doing a connect initiate at the same
time, not something that DECnet sees everyday.
- Four nearly simultaneous link requests for a DECrouter. These
systems only allow a single management link, and are good at rejecting
attempts once a link is up. Do they get in trouble if bombarded with
many requests while trying to set up context?
- Four threads in the DECnet AM all doing the same thing at the same
time. We think we have reentrant code. It LOOKS reentrant.
Mitigating any or all of these possibilties for V1.1 will be difficult.
We know we can alleviate the strain using prudence setting up the
rules. This may be our best answer for V1.1 given the risk of major
changes at this stage of development.
-Jim Carey
|
681.11 | FYI: DECnet-VAX probably not the problem | MARVIN::COBB | Graham R. Cobb (Wide Area Comms.), REO2-G/H9, 830-3917 | Mon Feb 11 1991 05:54 | 10 |
| > - Four nearly simultaneous link requests from a single process to
> DECnet. Likely, all four are doing a connect initiate at the same
> time, not something that DECnet sees everyday.
VAX P.S.I. does this all the time. Particularly during our stress testing:
we connect and disconnect a *lot* of links (at least a couple of hundred)
from the same process very frequently. We have never seen any problems with
DECnet-VAX when doing this (lots of problems in our code, however!).
Graham
|
681.12 | | SUBURB::SMYTHI | Ian Smyth 830-3869 | Tue Feb 12 1991 08:11 | 13 |
|
> - Four nearly simultaneous link requests for a DECrouter. These
> systems only allow a single management link, and are good at rejecting
> attempts once a link is up. Do they get in trouble if bombarded with
> many requests while trying to set up context?
I don't understand this. I can get multiple NML links
up to any DECrouter 2000. I thought that the single management link
was only a restriction on the DECSA box.
Ian
|
681.13 | Thanks for your ideas, we're making some headway too.... | TOOK::CAREY | | Tue Feb 12 1991 09:52 | 50 |
|
Re: -2 -- thanks for the information about the DEMSA box. In that
case, "Sure, the DECmcc DNA4 AM supports 'em all."
Re: -1 -- my mistake. Of all of the DECrouter boxes, the DECrouter
2000 is the only one I've encountered that does support multiple links.
The others very happily reject additional requests. In my aching head,
the DECrouters all tend to get lumped together. I'll try to do better.
Specifically, the DECrouter 200, 250 and DECSA's all support only a
single management link. The DECrouter 2000 supports many.
FYI - it was testing against a DECrouter 250 that showed us the router
might need up to a second to recover its context after the management
link is dropped.
I'm also glad to hear that I should discount the possibility that we're
having a problem interacting with DECnet.
Finally, we've made some headway against what we're doing wrong. We use
the phase4address for our transactions whenever possible, instead of
leaving DECnet to perform the translation. This allows us to fall back
by attempting to resolve the address from the DNS namespace if the
DECnet database doesn't know it, and gets the address into our
operations early, to maximize our ability to compare response entities
and create consistency through MCC. On VMS, we try to resolve the
address at the local database using NMLSHR to save time, and simplify
the process and user requirements of DECmcc.
Some interaction of these requests during our Local Processing is where
the damage seems to occur. We haven't successfully reproduced it
anywhere except on one 3520 dual processor machine, but on that machine
we have more trouble if the rules are polling locally than we do if we
attempt to go remote. Shutting off the NMLSHR back door gives us
consistently good results. It could be that our NMLSHR queueing isn't
completely reentrant, or it could be more directly related to the
hardware platform we're on. Anyway, we're pretty confident that we've
eliminated about 90% of the possibilities.
I'll post it here when we have some real results.
Thanks for your inputs.
-Jim Carey
|