T.R | Title | User | Personal Name | Date | Lines |
---|
5314.1 | | UTRTSC::utoras-198-48-113.uto.dec.com::JurVanDerBurg | Change mode to Panic! | Tue May 20 1997 12:42 | 9 |
| First of all, DECNET pipeline quota has nothing to do with cluster
communications. And raising it may make things worse if there's also
heavy decnet traffic.
I would suggest to investigate the network load over the DEChub.
It's purely a network connection problem.
Jur.
|
5314.2 | Suggestions... | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Tue May 20 1997 14:12 | 19 |
|
I'd simplify the various LAN segments involved, and I'd start looking
for cabling faults. You'll need to check the latency of those LAN
widgets, as well. Do you have access to a LAN monitor?
You'll also want to reAUTOGEN all nodes with FEEDBACK, and reboot.
Are there patterns to the messages? (eg: is node FAMV37 regularly
involved?) If so, concentrate on the patterns.
What does DECamds have to say about the configuration?
Seriously consider an upgrade from V5.5-2, as V7.1 is current. (And
we have rewritten shadowing, mount, and a number of other areas...)
(NCP and PIPELINE settings are entirely unrelated to the VMScluster
communications -- DECnet is involved only during the satellite node
download operation, and is not involved thereafter.)
|
5314.3 | More info ... | MLNCSC::CAREMISE | and then they were ...four ! | Wed May 21 1997 05:01 | 38 |
|
Thank you guys for your feedbacks !
Just a couple of things to clarify the situation :
Could you be more specific on how to check latency .
NETWK LOAD : We have put a sniffer on the Computer room's backbone
and the load on Ethernet is under 25% , and the error rate (CRC,
SHORT,RUNT...ecc) is very low.
What is DECamds : is it an 'available on the net' tool ?
Upgrade to Vms 7.1 is now impossible : customer's application are
related to Telephonic devices that are strictly linked to VMS 5.5-2
and we don't see way out on this matter, until the 3rd party appli-
cation will be re-written for an higher VMS version.....
If you suspect shadow or mount or other areas involved, pls. provide
patches info.
PATTERNS : the only thing I noted is that 2 station has an error rate
on PEA0 double than on other remaining station : I would check
if it is possible to remove it from cluster.
Anyway I'll suggest to system manager to re-run Autogen on all nodes
with feedback enabled, even if I'm a bit skeptic on this. I tend to
beleive more in a something involved with network too.
At this purpose I'll try to get more info on LAVC$FAILURE_ANALYSIS
What do you think about it ?
Don't you have any advice on errorlog entry meaning ?
Thanks again and Ciao !
Sergio (MCS Milano, Italia)
|
5314.4 | More Info | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Wed May 21 1997 10:48 | 60 |
| : Could you be more specific on how to check latency .
Confirm the path. Confirm the number of devices. Confirm that
the devices are suited for this application. More than a few
customers will tell you that their network configuration is `X',
and when you actually look, you find `Y'. Also confirm that the
configuration of the network is valid -- more than a few sites
have seen a misused "T" connector or an unterminated LAN segment,
as specific examples.
Also see what DTS/DTR show for throughput on the link -- these
are DECnet tools, but these can load up a network nicely.
I'd expect that specific round-trip measurements would require
a LAN monitor.
: NETWK LOAD : We have put a sniffer on the Computer room's backbone
: and the load on Ethernet is under 25% , and the error rate (CRC,
: SHORT,RUNT...ecc) is very low.
What counters are increasing in DECnet, if any? (Zero the counters,
run the DTS/DTR tools to generate a load, and see what happens to the
DECnet line and circuit counters.)
: What is DECamds : is it an 'available on the net' tool ?
It's part of OpenVMS. A very valuable part for managing a network
of nodes, or a VMScluster, too.
: Upgrade to Vms 7.1 is now impossible : customer's application are
: related to Telephonic devices that are strictly linked to VMS 5.5-2
: and we don't see way out on this matter, until the 3rd party appli-
: cation will be re-written for an higher VMS version.....
I will assume the customer has a "prior version support" contract.
: If you suspect shadow or mount or other areas involved, pls. provide
: patches info.
Check http://www.service.digital.com -- I'd look, but the link is
down right now.
: PATTERNS : the only thing I noted is that 2 station has an error rate
: on PEA0 double than on other remaining station : I would check
: if it is possible to remove it from cluster.
Make sure there are not overlapping cluster groups or a bad cluster
password involved here -- and what are the other errors that are in
the error log? (The entry listed in .0 is rather nondescript...)
: Anyway I'll suggest to system manager to re-run Autogen on all nodes
: with feedback enabled, even if I'm a bit skeptic on this. I tend to
: beleive more in a something involved with network too.
: At this purpose I'll try to get more info on LAVC$FAILURE_ANALYSIS
: What do you think about it ?
When one node has out-of-whack SYSGEN parameters, the whole VMScluster
can encounter pronlems when that node gets "backed up". (This is why
I asked you to check for any common patterns on the error messages.)
|
5314.5 | TIMVCFAIL is a step forward | MLNCSC::CAREMISE | and then they were ...four ! | Mon May 26 1997 12:24 | 18 |
| The problem on the cluster with vax 6000's and satellites has been
solved, downsizing the TIMVCFAIL parameter from 1600 to 800.
Unfortunately, the same had no effect on the other cluster (pure NI)
We still have a lot of system buffer unavailable on decnet lines,
and also some receive errors (frame too long) we tryied to enlarge
line's receive buffers (from 10 to 20) with no appreciable results
except for 3 stations, that now runs OK.
Can be a problem of NPAGEDYN oversized ? ( we noted on many station
a value over 8 million, with an effective usage around 1 million.)
Other parameters of the pool looke OK.
Any comments ?
|
5314.6 | | UTRTSC::jgoras-197-2-3.jgo.dec.com::JurVanDerBurg | Change mode to Panic! | Tue May 27 1997 02:08 | 17 |
| Lowering TIMVCFAIL is not a solution but a workaround for your network problems.
A lot of system buffer unavailable means that the network is so busy that the
system has a hard time keeping up, and drops packets.
> Can be a problem of NPAGEDYN oversized ? ( we noted on many station
> a value over 8 million, with an effective usage around 1 million.)
That means that there's been a peak usage of NPP which may be attributed
to network broadcast storms.
I would seriously take a good look at the network load and check if you
can do something about that, like adding bridges etc. Or check for other
bad things. A network trace can do wonders.
Jur.
|
5314.7 | LAVc Troubleshooting, Key patches?? | STAR::BOAEN | LANclusters/VMScluster Tech. Office | Tue May 27 1997 16:01 | 28 |
| VERIFY HAVE PEDRIVER PATCHES:
Before doing anything else, make certain that the nodes have the following TIMA kit:
PEDRIVER V5.5-2 VOID/TIMA kit: VAXLAVC03_U2055;
or CSC patch kit CSCPAT_1081
It's been out for several years, but if you don't have it in, it's the 1st thing to do.
We significantly improved PEdriver's ability to deal with network congestion & delay
variations somewhare around V6.0. This kit back-port these changes to V5.5-2.
READ THE MANUAL:
The "Troubleshooting the NISCA Protocol" appendix to the
V6.1 & higher versions of the VMScluster Systems manual shows how
to use SDA to get & interpret counters & delay information from
PEdriver's port, VC, & Channel data structures. This should help
identify why PEdriver is closing VCs. I suspect that a channel is
getting listen timeouts because packets are being lost due to network
congestion or (less likely) faulty network HW.
00x = UNRECOGNIZED OPCODE
The errorlog analyzer doesn't understand that some errorlog entries don't
have a message buffer attached. It always assumes that the message buffer fields
are there. In this case there isn't any message & these fields are all 0s.
The OPcode value of 00x is undefined for PEdriver. So this part of the
errorlog report is missleading...
'Gards, Verell
|
5314.8 | update | MLNCSC::CAREMISE | and then they were ...four ! | Wed May 28 1997 05:50 | 28 |
| Probably I wasn't clear in my .0
It cannot be a problem of loosed connection 'cause the cabling is TP,
that means direct connection to the Repeater. No thin wire involved.
All port are switched (that means 'bridge' in my opinion...) and every
Repeater is switched again on the backbone through a 3COM Port Switch.
The boot node is directly connected to the other DecSwitch (that IS a
Bridge) so everything is ALREADY Bridged.
Traffic percentage on the two networks are below 25% , so there's not
ethernet congestion.
VAXLAVC patch is already been installed ( read my .0 )
The only thing I can agree with, is the chance to have a lot of
broadcast storms, and now I will investigate on this.
It seems that this storms comes from SUN Stations.
Can someone tell me something on how to work on storms, expecially
on non DEC systems ?
And how storms can be the cause of this disconnections ?
Thanks again. Sergio.
|
5314.9 | | UTRTSC::jgoras-197-2-3.jgo.dec.com::JurVanDerBurg | Change mode to Panic! | Wed May 28 1997 07:29 | 14 |
| >Can someone tell me something on how to work on storms, expecially
>on non DEC systems ?
Start measuring with a sniffer, and if non-dec systems are causing
storms contact the system managers for these systems and let them
find out what's wrong.
>And how storms can be the cause of this disconnections ?
Heavy broadcast storms can cause severe packet loss, and if that happens
frequently enough the scs will timeout.
Jur.
|
5314.10 | Look for Listen Timeouts | STAR::BOAEN | LANclusters/VMScluster Tech. Office | Thu May 29 1997 12:16 | 37 |
| To determine if connections are being lost due to NISCA multicast packets
being lost use SDA to examine the PEdriver channels between the
two nodes - do the following on each node from a priviledged account:
$ ANALYZE/SYSTEM
$ SHOW PORT
$ SHOW PORT/CH/VC=VC_nodename
This will get you the PEdriver internal counters.
Look at the channel errors section of each channel to see if
listen timeouts are occurring:
SDA>
VMScluster data structures
--------------------------
-- Active Channel (CH:812F00C0) for Virtual Circuit (VC:8126ABC0) ZAPNOT --
State: 0004 open Status: 0B path,open,rmt_hwa_valid
BUS: 8123D100 (FXA) Lcl Device: FX_DEMFA Lcl LAN Address: 08-00-2B-3B-15-85
Rmt Name: FXA Rmt Device: FX_DEMFA Rmt LAN Address: 08-00-2B-29-E1-
Rmt Seq #: 0001 Open:21-MAY-1997 07:33:44.70 Closed:21-MAY-1997 07:31:05.77
------- Transmit ------ ------- Receive ------- ----- Channel Selection ----
Lcl CH Seq # 0008 Msg Rcv 3161273 Average Xmt Time 00314521
Msg Xmt 19 Mcast Msgs 3161263 Remote Buffer Size 4382
Ctrl Msgs 14 Mcast Bytes 309803774 Max Buffer Size 4382
Ctrl Bytes 1372 Ctrl Msgs 10 Best Channel 8
Bytes Xmt 1822 Ctrl Bytes 980 Preferred Channel 5
Rmt Ring Size 31 Bytes Rcv 309804754 Retransmit Penalty 2
--------------- Channel Errors --------------- Xmt Error Penalty 0
Handshake TMO 0 Short CC Msgs 0 ------- Channel Timer ------
Listen TMO 7 Incompat Chan 0 Timer Entry Flink 81204D40
Bad Authorize 0 No MSCP Srvr 0 Blink 8124F540
Bad ECO 0 Disk Not Srvd 0 Last Ring Index 10
Bad Multicast 0 Old TR Msgs 0 Protocol 1.4.0
Topology Change 0 Supported Services 00000000
|