T.R | Title | User | Personal Name | Date | Lines |
---|
2620.1 | | EEMELI::MOSER | Orienteers do it in the bush... | Mon Jun 05 1995 03:26 | 11 |
2620.2 | LIB error right after set hsot command. | BWTEAL::W_MCGAW | | Mon Jun 05 1995 13:33 | 9 |
2620.3 | | EEMELI::MOSER | Orienteers do it in the bush... | Tue Jun 06 1995 02:48 | 33 |
2620.4 | test.mar (quick and dirty example prog) | EEMELI::MOSER | Orienteers do it in the bush... | Tue Jun 06 1995 02:49 | 69 |
2620.5 | I had this pb... | MOSCOW::JOUVIN | Michel Jouvin - Digital Moscow | Tue Jun 06 1995 07:37 | 9 |
2620.6 | Got the file for testing. | BWTEAL::W_MCGAW | | Tue Jun 06 1995 11:21 | 11 |
2620.7 | | TFOSS1::HEISER | maranatha! | Sat Sep 21 1996 15:52 | 5 |
2620.8 | Transport connections at max ? | COMICS::WEIR | John Weir, UK Country Support | Mon Sep 23 1996 05:42 | 20 |
2620.9 | More than one possible cause... | CSC32::D_WILDER | There's coffee in that nebula! | Mon Sep 23 1996 15:00 | 105 |
2620.10 | what's typical? | TFOS02::HEISER | Maranatha! | Tue May 06 1997 16:33 | 5 |
| What's a typical value for VMS 6.2, DECnet/OSI 6.3 ECO6 on a 6540 with
256Mb RAM? I've set a node like this to 75000 and it still ran out.
thanks,
Mike
|
2620.11 | | CANTH::WATTUM | Scott Wattum - FTAM/VT/OSAK Engineering | Tue May 06 1997 16:40 | 5 |
| You should first take a look at note 3762 and verify that you aren't
having a that problem, which has a different fix.
--Scott
|
2620.12 | | TFOS01::HEISER | Maranatha! | Fri May 09 1997 17:42 | 4 |
| No that's not it. These are SEPS97 machines which all have the correct
values for CTLPAGES and CTLIMGLIM
Mike
|
2620.13 | Register all your nodes ? | COMICS::WEIR | John Weir, UK Country Support | Tue May 13 1997 04:45 | 35 |
|
Mike,
> What's a typical value for VMS 6.2, DECnet/OSI 6.3 ECO6 on a 6540 with
> 256Mb RAM? I've set a node like this to 75000 and it still ran out.
Please do not assume that all problems which produce the INVARG error are the
same problem... Generally, the problem occurs when NET$ACP runs out of
VA (virtual address space), but there might be other reasons... There are
a number of reasons (and/or bugs) which may result in NET$ACP running out
of VA. Over time, these bugs are fixed, and it is known that there are good
fixes for some of them. For example, the CTLPAGES problem is fixed for
VAX VMS V6.1 and V6.2 by VAXSYS08_062.
Typical (ie non-buggy) usage of pagefile quota should be under 10k. If the node
is used for very large numbers of connections, where the connections are made
in bursts, or where the node is a very busy DNS Server then values over 20k
are possible, but if you get values over 25k then look for bugs. In other
words, I believe that your system is suffering from one of the bugs...
There is a problem in V6.3 ECO-6 which shows up particularly often if you
do not register all of your nodes in the namespace. The frequency of onset of
this problem is drastically reduced if you ensure:
a) that all nodes in your network are registered correctly in the
naming service
b) that you increase the "Sess Control Naming Cache Timeout" to
something which greatly exceeds the anticipated fix time for the
bug ;-)
Regards,
John
|
2620.14 | net$acp exhausted pgflquota (6.3 eco 6) | PRSSOS::MAGENC | | Wed May 21 1997 10:59 | 31 |
|
Hello !
John , in your previous reply , you say :
<<There is a problem in V6.3 ECO-6 which shows up particularly often
if you do not register all of your nodes in the namespace. >>
Could you please provide more info about this problem ?
(IPMT case etc ?)
Here in Easynet France, this problem has been experimented twice in
2 weeks (Cluster OpenVMS VAX 6.1, DNVOSI 6.3 eco 6 , directory
services DECDNS,local) .
Having all the nodes registered in the namespace (DEC:) is nearly
impossible .
We checked that it's not a "CTLPAGES" problem.
When this problem occurred for the second time (20 may 97), we
changed "session control Naming Cache Timeout" to 1000 days,
then rebooted. It's a "production" cluster called EVTISA .
This problem occurred once on EVTV10 and once on EVTIS6 , with
a pgflquota value = 75000 for net$acp !
Under "normal" circumstances , the pgflquota used is between 10000
and 15000 ; NSP and OSI TRANSPORT both have maximum connections = 500
What else could be done ?
Thanks in advance, and best regards , Michele .
|
2620.15 | could it be max transport connections?? | CSC32::J_RYER | MCI Mission Critical Support Team | Wed May 21 1997 12:37 | 11 |
| I just escalated a case for MCI (sorry, don't have a cfs number yet,
as CHAMP/CSC seems to be slow passing things to IPMT) on a similar
problem on a system running OSI V6.3 ECO-6. In their case, we think
the memory leak was triggered by bumping up against OSI Transport
Maximum Transport Connections (due to a bug in application code written
by the user). See note 2990.1 in this conference; John Weir
escalated the problem as IPMT case CFS.27302, but it's not evident
that a fix had been issued as of eco 6.
Jane Ryer
MCI Mission Critical Support Team
|
2620.16 | | TFOS02::HEISER | Maranatha! | Wed May 21 1997 14:52 | 6 |
| The node that forced me to bring this issue up just exhausted a 100K
pgflquota in 2 weeks. It only took a few days to do 75K. The strange
thing is that the other node in the same production cluster is fine
with 75K (has been for the 2 months since the upgrade to ECO6).
Mike
|
2620.17 | | TFOS02::HEISER | Maranatha! | Wed May 21 1997 14:58 | 12 |
| |Typical (ie non-buggy) usage of pagefile quota should be under 10k. If the node
|is used for very large numbers of connections, where the connections are made
|in bursts, or where the node is a very busy DNS Server then values over 20k
|are possible, but if you get values over 25k then look for bugs. In other
|words, I believe that your system is suffering from one of the bugs...
John, I find this interesting. Did you know that on CCS production
clusters that 50K is a "standard" value? These are usually heavily
loaded clusters (i.e., several hundred users).
later,
Mike
|
2620.18 | | TFOS02::HEISER | Maranatha! | Wed May 21 1997 15:13 | 11 |
| I just adjusted max cache timeout to 1000. I got this vague error when
trying to adjust max connections. This is with 100k pgflquota exhausted.
$ mcr ncl set osi transport maximum transport connect 250
Node 0 OSI Transport
at 1997-05-21-11:13:20.466-07:00I1.576
command failed due to:
process failure
|
2620.19 | V6.3 ECO-6 CDI bug with "lost" lookups | COMICS::WEIR | John Weir, UK Country Support | Thu May 22 1997 06:51 | 132 |
|
Hi,
There are well known and long standing problems if you reach the "Maximum
Transport Connections" limits for either NSP or OSI Transport. These are NOT
the issues that I refered to earlier.
Briefly, the "Maximum Transport Connections" problem is well known, and
you avoid it by either a) fixing your application so it does not beat on the
limit, or b) increase the limit. Before you increase the limit, you have to
increase "Maximum Remote NSAPs" to be at least one greater than the intended
new value for "Maximum Transport Connections". This problem is annoying, but
appears unlikely to be fixed. It's rather like beating your head on a brick
wall -- if it hurts, then don't do it!
The problem that I referred to exists in V6.3 ECO-6, and presumably in V7.1.
You will not see it in any later versions, because Engineering will fix it
before the next ECO and/or version (sic ;-)). You are unlikely to see it in
earlier versions or ECOs. I believe (although I am not sure whether Engineering
agree) that the underlying bug may have existed in DECnet/OSI since V6.0 SSB,
but that it has not shown up until implementation of the ECO-6 version of
the dynamic CDI cache. There was an earlier "dynamic CDI cache" kit, which
some people installed as an optional addition to their systems. I believe
that this earlier kit did not include the "CDI meltdown" fix, which was
bundled into the V6.3 ECO-6 CDI and which exposed the earlier bugs ...
Do you follow me so far ?
Just to summarise the terminology:
Dynamic CDI cache: The original CDI cache design was a fixed size file.
Unfortunately, the original size was too small for busy systems, so it was
increased. As every member of a cluster has the same sized cache file, this
meant that several hundred thousand blocks of system disk could be consumed
in a large cluster, even though most systems were satellites and only required
small cache files. The solution was dynamic CDI cache, which dynamically
increased the cache based on demand. This was implemented as an "early release"
kit, and in V6.3 ECO-6.
"CDI meltdown": A phrase coined by Bob Watson -- but he coins so many that he
can probably no longer remember ;-) A feature of the original CDI design
was that if several lookups for the same nodename (or backtranslation) occur
at about the same time, and if the name/backtranslation is not in the CDI
cache, then CDI will do several DNS lookups in parallel instead of optimising
and doing just one DNS lookup to satisfy all requests. The enhancement
(included in the V6.3 ECO-6 CDI) was to detect this condition. If several
CDI lookups are done for the same name/backtranslation which is not in the
cache, then the first lookup triggers a real DNS lookup, while the others
are queued up to await completion of the first lookup. You can see this
on a CDI trace under V6.3 ECO-6, where you will see the first lookup
recorded as "parent" and queued lookups recorded as "child". BTW: Just
for completeness of the description, this change is a nice optimisation in
most cases, but it actually solved a very serious problem on DNS Servers.
Specifically, all nodes from time to time lose their own CDI cache entry.
(The default is 30 days, or hardcoded at 7 days on reboot...) Whenever a
DNS Server loses its own CDI cache entry, there is a severe risk that
disaster will strike! When the DNS Server loses its CDI entry then CDI will
use the DNS Clerk to do a lookup on its name--- This involves a DECnet link
back to itself (ie Clerk and Server are on the same node) and the incoming
connect must be backtranslated requiring a lookup on its name requiring
another logical link from Clerk to Server requiring another backtranslation
lookup of its name and so on in a loop until something runs out of resources
and fails. Maybe the DNS Server runs out of memory or some other resource.
Maybe NSP or OSI Transport runs out of "Maximum Transport Connections". Maybe
you like that last one in particular ?? It links together this problem with
the otherwise totally unrelated "Maximum Transport Connections" problem that
I dismissed at the start of this reply.
CDI "sticky" bit: Given the severity of problems which might occur when a
node loses its own CDI cache entry (particularly DNS Server nodes) Engineering
have enhanced the CDI design, yet again, so that the CDI cache entry for
a node's own name and that of its Cluster Alias are not timed out and
therefore are not periodicly removed from the CDI cache. This enhancement
has been implemented susequent to V6.3 ECO-6 and will appear in V6.3 ECO-7.
CDI 7-day hardcoded timeout: CDI up to and including V6.3 ECO-6 has a hardcoded
timeout of 7 days (which can only be over-ridden by the logical name
CDI_CACHE_TTL). When the SEARCHPATH .NCL is executed, this 7-day timeout is
over-ridden by the value specified in the .NCL. But, during boot there is
a 20 second period between the startup of NET$ACP and the execution of
the .NCL when the timeout is not easily controllable by the System Manager.
Subsequent to V6.3 ECO-6 Engineering have "fixed" this so that the timeout
is "infinite" during this timing window, and is then controlled by the .NCL.
Also, if you use the .NCL to set the timeout to 0, it sets it to "infinite".
(Previously, setting the timeout to 0 would set it to 7 days, again!!)
That's the preamble completed -- who's still with me?
The long-standing bug, which has only shown up with V6.3 ECO-6, is that from
time to time backtranslation operations may get "lost" in CDI. (At least,
the current theory is that the bug is in CDI, although it might be elsewhere.)
I have only seen these problems for incoming NSP connections. The problems
may well occur for incoming OSI Transport connections, but I have just not
seen them. Also, I thought I heard that a variation of the problem may occur
for outgoing connections, although I have no idea what the symptoms might be.
For an incoming NSP connection, if the backtranslation is not in the CDI cache
then CDI has to do a DNS lookup. Sometimes, the CDI/DNS lookup just gets
"lost", and in this case the incoming NSP connection just "hangs". At this
point in time there is no timer on the incoming NSP port, so the port just
remains on the system "for-ever" and consumes one of the "NSP Maximum
Transport Connections". (Of course, if the CDI/DNS lookup completes
successfully then everything is OK. Also, if the CDI/DNS lookup fails, then
the failure status is used when continuing to process the incoming connection,
and the incoming connection appears to come from node 12345:: instead of
DEC:.XYZ.FRED:: -- ie you get a backtranslation failure, but a successful
connection.) "Losing" a CDI/DNS lookup is a rare event -- on a very busy system
it might occur once a week, and at that rate (prior to V6.3 ECO-6) it would
take 4 years without reboot to consume all of your 200 (default) "NSP Maximum
Transport Connections".
The problem is that V6.3 ECO-6 includes the "CDI meltdown" fix. (Remember,
with this fix, CDI lookups for the same name are queued until the first
lookup completes ?) The problem with this fix is that if the CDI/DNS lookup
at the head of the queue (ie the "parent") gets "lost" then it does not
complete, and none of the queued "child" lookups will complete. Furthermore,
all subsequent lookups of the same name/backtranslation will find that
there is a lookup in progress (ie the outstanding "parent") and they will
also be queued with no chance of ever completing. Thus, every incoming
connection from that name/backtranslation will be queued in the same way,
and will consume NSP ports until you run out. Each outstanding connection
on the queue consumes a significant amount of NET$ACP VA, so you will
both run out of transport connections and will run out of NET$ACP VA and
it is just a race to see which happens first. The only solution is reboot.
The problem of "lost" CDI/DNS connections is expected to be fixed in V6.3
ECO-7 and in V7.1 ECO-1.
Regards,
John
|
2620.20 | THANKS | PRSSOS::MAGENC | | Fri May 23 1997 11:32 | 12 |
|
Waoh !!!!
What a WONDERFUL answer !
Thanks a lot for such details : John, you're a REAL GURU !
Your explanations are very clear and usefull .
That's great !
Best regards , Michele .
|
2620.21 | | TFOS02::HEISER | Maranatha! | Fri May 23 1997 13:08 | 1 |
| John, do you have an estimated date yet for ECO7?
|
2620.22 | soft restart? | PHXSS1::HEISER | Maranatha! | Tue May 27 1997 17:58 | 7 |
| Is there any way to shutdown the network and recreate NET$ACP without
rebooting? NET$SHUTDOWN doesn't recreate the process. This is starting
to impact business production clusters (especially since we are
approaching fiscal year end).
thanks,
Mike
|
2620.23 | Wait days, or else use IPMT | COMICS::WEIR | John Weir, UK Country Support | Wed May 28 1997 04:39 | 20 |
|
No, I do not know of any way to stop and restart NET$ACP.
I suspect that even if you did something devious to get rid of
NET$ACP you would not be able to restart it, as there is almost
certainly some initialisation of the NET$ACP/NET$DRIVER interface
which would not survive any such tampering ;-)
Engineering have produced a fix -- at this stage it survives
lab tests (and previously I could reproduce the problem in under
30 seconds) -- although none of my Customers have installed it yet.
So, it looks as though the fix will be on general distribution within
days, but you know the rules -- if you have a Business critial issue,
you use the IPMT system, not notesfiles.
Regards,
John
|
2620.24 | | PHXSS1::HEISER | Maranatha! | Wed May 28 1997 12:16 | 1 |
| Well, I've downgraded to ECO5 in the meantime.
|
2620.25 | CSC patch kits | PHXSS1::HEISER | Maranatha! | Fri May 30 1997 18:51 | 5 |
| Have patch kits VAXSHAD09_061 and VAXSYS08_062 been proven to fix the
pool expansion problem?
thanks,
Mike
|
2620.26 | ECO kits fix problems they were intended to fix | COMICS::WEIR | John Weir, UK Country Support | Mon Jun 02 1997 09:30 | 24 |
|
Mike,
> Have patch kits VAXSHAD09_061 and VAXSYS08_062 been proven to fix the
> pool expansion problem?
These kits have been proven to fix the problems that they fix -- period.
VAXSYS08_062 fixes a leak of process alloc region, which only shows up if
you set CTLPAGES higher than the SYSGEN default. If CTLPAGES is 128 or less,
then there is no way that you could suffer the problem. Thus, if CTLPAGES
is 128 or less, and you have a problem, VAXSYS08_062 will not fix that problem.
VAXSHAD09_061 fixes whatever NPAGEDYN leaks it is documented to fix... I
can't remember.
The DECnet/OSI V6.3 ECO-6 CDI problems are not resolved by either of these, but
will be resolved by ECO-7. Engineering have proved that they have a good fix.
I can confirm the fix is good.
Regards,
John
|