[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference lassie::ucx

Title:DEC TCP/IP Services for OpenVMS
Notice:Note 2-SSB Kits, 3-FT Kits, 4-Patch Info, 7-QAR System
Moderator:ucxaxp.ucx.lkg.dec.com::TIBBERT
Created:Thu Nov 17 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5568
Total number of notes:21492

5524.0. "SYSTEM-F-CONNECFAIL and PATHWORKS" by TLE::MICHAUD (Lisa Michaud, DTN 381-0879) Mon May 19 1997 16:10

I'm attempting to solve a problem that started a few months ago, before I
arrived here.  As far as I know, nothing on the system changed.  I have since
upgraded UCX to V4.1 ECO4 (OpenVMS/VAX is version 7.0), but the problem still 
exists.  I also changed the send/receive TCP quotas to 250000, because that was
a suggestion given in one of the Notes files I've been searching.  Other notes
seem to suggest that it's either a UCX resource problem or a physical network
problem.

People are randomly losing connections to the PATHWORKS file server when doing
builds on their PCs.  It's completely random as to when it happens, and it's
happening from multiple PCs running different versions of NT and W95.  They
all use TCP/IP to connect to the file server.

Here is what's in the PATHWORKS file server log file:

15-May-97 11:28:21 (netio) Device bg, unit 51 (bg51): Network read AST returned
error status 8412 (%SYSTEM-F-CONNECFAIL, connect to network object timed-out or
failed)

And sometimes the errors are more extensive:

ssn_rqst_trans: IO$_WRITEVBLK channel 480 error 8428(%SYSTEM-F-LINKDISCON,
netwo
rk partner disconnected logical link)
shutdown_socket: IO$M_SHUTDOWN channel 480 error 20(%SYSTEM-F-BADPARAM, bad
para
meter value)
shutdown_socket: IO$_DEACCESS channel 480 error 8412(%SYSTEM-F-CONNECFAIL,
conne
ct to network object timed-out or failed)
ssn_rqst_trans: IO$_WRITEVBLK channel 576 error 8428(%SYSTEM-F-LINKDISCON,
netwo
rk partner disconnected logical link)
shutdown_socket: IO$M_SHUTDOWN channel 576 error 20(%SYSTEM-F-BADPARAM, bad
para
meter value)
shutdown_socket: IO$_DEACCESS channel 576 error 8412(%SYSTEM-F-CONNECFAIL,
conne
ct to network object timed-out or failed)


Is there a way to tell if this is some sort of resource problem, or a physical
network problem?  It's possible that someone changed something on the system
before I arrived, either with a SYSGEN or UCX parameter.  Here's some UCX info
from the system:


Communication Parameters

Local host:      murtl                  Domain:   zko.dec.com

Cluster timer:             5
                                 Maximum     Current        Peak
Interfaces                            20           2           2
Device_sockets                       300          21          21
Routes                             65535          13          13
Services                             200           0           1
Proxies                               58

Type:        Ethernet   Free     Maximum   Max Bytes     Minimum   Min Bytes
Large buffers             20         200      377600          10       18880
Small buffers            150        1000      256000          50       12800
IRPs                      20         200
Non UCX buffers           10

Remote Terminal
  Large buffers:          10
  UCBs:                    4
  Virtual term:     disabled


                                   MBUF Summary
                      Small_static  Large_static  Small_dynamic  Large_dynamic
 Total buffers                  50            10             50              0
 Free                            1             8             33              0
 Busy
  Data                           0             2              0              0
  Header                         5             0              6              0
  Socket                        16             0              5              0
  Prot. control                 11             0              6              0
  Route                         13             0              0              0
  Socket name                    0             0              0              0
  Socket options                 0             0              0              0
  Fragment reassembly            0             0              0              0
  IP address                     2             0              0              0
 Size of cluster             13056         19136          13120              0

                Free       Current          Peak          Waits          Drops
 Small Buffers                  66            67              0              0
 Large Buffers                   2            10              0              0
 IRPs              3             0             3              0              0

                    Small clusters  Large clusters    Non UCX buffers
 Free                            0             0              0


 TCP
   Connect initiated:             0        Connect accepted:              12
   Connect established:          12        Connect closed:                 1
   Connect dropped:               0        Embry connect drop:             0
   Attempt rtt:               87422        Succeeded rtt:              84412
   XMT Delayed ACKs:          35238        Connect timeout:                0
   ReXMT timeout:              3159        Persist timeout:                0
   Keepalive timeout:             0        Keepalive probes:               0
   Keepalive drops:               0        Total XMT segments:        127744
   XMT segments:              87409        XMT bytes:               12872705
   XMT packet reXMT:           3159        XMT bytes reXMT:           483374
   XMT ACK only:              37175        XMT window probes:              0
   XMT URG only:                  0        XMT wind update pack:           0
   XMT CTRL segments:             1        Total RCV segments:         95246
   RCV segments:              87424        RCV bytes:                5821967
   RCV chksum error:             26        RCV bad offset:                 0
   RCV too short:                 0        RCV dup only pack:           1924
   RCV dup only bytes:       123939        RCV part dup pack:              0
   RCV part dup bytes:            0        RCV bad order pack:             0
   RCV bad order bytes:           0        RCV pack after wind:            0
   RCV bytes after wind:          0        RCV pack after close:           0
   RCV window probes:             0        RCV dup ACKs:                   1
   RCV ACK for unXMT:             0        RCV ACK segments:           87422
   RCV ACK bytes:          12872718        RCV wind update pack:           0


TCP
  MTU size segment:      disabled
  Delay ACK:              enabled
  Loopback:              disabled
  Window scale:           enabled
  Drop timer:                 600
  Probe timer:                 75

                          Receive                Send

  Checksum:               enabled             enabled
  Push:                  disabled            disabled
  Quota:                   250000              250000



If a TCPIPTRACE to one of the PCs would help, how should I set it up?  It 
would have to run overnight during a build, and I would think the output would 
be quite huge.  Is there some way I can limit it to only show the necessary 
things?

Any hints would be appreciated...

Lisa
T.RTitleUserPersonal
Name
DateLines
5524.1oops, misleading...TLE::MICHAUDLisa Michaud, DTN 381-0879Mon May 19 1997 16:3330
    It's misleading that under the "UCX SHOW PROTOCOL TCP" it shows
    "connect dropped" as 0 (I had recently restarted UCX).  The "connect
    dropped" number normally matches the number of SYSTEM-F-CONNECFAIL 
    messages in the PATHWORKS server log.  Here's another TCP snapshot
    after some errors have occurred:
    
    Connect initiated:             0        Connect accepted:             20
    Connect established:          20        Connect closed:                8
    Connect dropped:               7        Embry connect drop:            0
    Attempt rtt:              143492        Succeeded rtt:            138621
    XMT Delayed ACKs:          62935        Connect timeout:               0
    ReXMT timeout:              5131        Persist timeout:               0
    Keepalive timeout:             2        Keepalive probes:              2
    Keepalive drops:               0        Total XMT segments:       214573
    XMT segments:             143440        XMT bytes:              21150670
    XMT packet reXMT:           5127        XMT bytes reXMT:          822845
    XMT ACK only:              65994        XMT window probes:             0
    XMT URG only:                  0        XMT wind update pack:          0
    XMT CTRL segments:            12        Total RCV segments:       156509
    RCV segments:             142533        RCV bytes:               9478894
    RCV chksum error:             42        RCV bad offset:                0
    RCV too short:                 0        RCV dup only pack:          3041
    RCV dup only bytes:       195834        RCV part dup pack:             0
    RCV part dup bytes:            0        RCV bad order pack:            0
    RCV bad order bytes:           0        RCV pack after wind:           0
    RCV bytes after wind:          0        RCV pack after close:          0
    RCV window probes:             0        RCV dup ACKs:                  7
    RCV ACK for unXMT:             0        RCV ACK segments:         143447
    RCV ACK bytes:          21173736        RCV wind update pack:          3
    
5524.2fragmented diskTLE::MICHAUDLisa MichaudWed May 28 1997 09:5217
    FYI, for anyone that might be encountering the same types of errors-
    
    the problem *in this case* ended up being the disk that they were 
    accessing.  It was very badly fragmented.  A complete image backup and 
    restore fixed the problem.  The system must have had a hard time keeping 
    the connection alive while trying to piece together the fragmented files
    (the worst file had *26,500* extents!).  Also, the system that they
    were attaching to wasn't the system where the actual disk resides, so
    that probably added to the problems.
    
    The problem probably didn't start as "suddenly" as they thought- it
    just became unbearable a couple of months ago, when the disk got to a
    point where it was so fragmented that it was unusable.
    
    Lisa