[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference lassie::ucx

Title:DEC TCP/IP Services for OpenVMS
Notice:Note 2-SSB Kits, 3-FT Kits, 4-Patch Info, 7-QAR System
Moderator:ucxaxp.ucx.lkg.dec.com::TIBBERT
Created:Thu Nov 17 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5568
Total number of notes:21492

5259.0. "Real cause of SS$_UNREACHABLE from $QIO?" by OSEC::BURKEP (Pete, OSEC, UK SI) Fri Feb 21 1997 06:14

Hello,

I am working on a very high-availability system; one that cannot go down. It
has a number of mechanisms to ensure the availability. One is to voluntarily
shut-down processes that hit unusual errors, assuming that the process is
internally screwed. Any missing process is then automatically restarted. If it
disappears too many times in a five-minute period then it is no longer
re-started; this is to avoid continuous re-starts causing more problems than
the disappearing process.

A critical process, using UCX $QIO IO$_WRITEVBLKs, returned SS$_UNREACHABLE
because of transient external network errors. This resulted in the critical
process shutting itself down too many times and resulted in a loss of service.

This was politically very bad for us and the customer is insisting we improve
our error handling so that this won't happen in the future. My problem is that
I can't see how I can tell, from just a $QIO return status, whether the problem
is an external network error (when the process shouldn't shut down) or another
internal (programming?) error (when shutting down the process is correct).
Examining the 'System Services and C Socket Programming Manual' shows my
problem. The particular condition value returned from the $QIO IO$_WRITEVBLK
was:

SS$_UNREACHABLE         Programming error:  Either the
                        	network address is invalid or the
                                network is unreachable.
                     	Hardware error: The data link
                               	adapter detected an error and
                                shut itself off. The UCX software
                                is waiting for the adapter to come
                                back on line.

but this may not be the only problem one. The problem is that I can't tell
whether the problem was 'network address is invalid' which is likely to be an
internal programming error, or 'network is unreachable' which is likely to be
an external network error. It would also be nice to know if the problem is a
hardware error. 

So, I have a couple of questions:

1)  Can I find out any more about what the error is in this situation? Can I
distinguish between the different causes of SS$_UNREACHABLE.

2)  Is the list of Condition Values Returned in the manual complete? I may have
to make some arbitrary decisions on each of the condition values and say this
is 99% likely to be an internal error, or external error. So what I dont want
are any other condition values, other than those documented, creeping out of
the woodwork.

I know this may seem a little extreme, but this is politically very bad for us.

Thanks,
Pete.
T.RTitleUserPersonal
Name
DateLines
5259.1Gee, if anyone can solve this problem, they should get a Noble prize by 2001twick.nio.dec.com::PETTENGILLmulpMon Feb 24 1997 22:1757
The problem you have is that someone read the error message documentation
and assumed that literally meant whet it says.  The preface should be
"If the network is working ok, then"

The second problem is that this is a high availability network based on IP.
Bad move.  You really have to know what you are doing to build a highly
available application on top of an existing protocol stack.  DECnet-PLUS
is built on OSI and we got lots of availability options included in OSI
so if you are smart and lucky, you might be able to get DECnet-PLUS to
do the work for you, but it won't be hard.  VMSclusters would do ok as
well, but that wouldn't be cheap and you really don't want to do realtime
in a cluster of any sort.

Basically, every network protocol launches a packet into the ether and
waits for a reply.  If there is no reply, then it tries again.  After a
while it reports IE.NFW, or equivalent.  (IE.NFS is a classic RSX error -
IE.NFW - path lost to partner).

The reasons that the packet didn't get to the destination are
	a hardware error
	the intervening network is broken
	the destination isn't working
	the destination isn't working because is was never intended to exist

It would be nice if a node that didn't exist would send back an error message
saying, "I don't exist", but no one has quite figured out how to implement it.

Likewise, it would be nice if a node that wasn't up would send back a message
saying "I can't talk with you right now because I'm not ready to send messages"
but no one has figured that one out either.

Now, it is possible for an intervening router to send back a message that
says, I don't know how to get to the network; the ICMP message exists and
it is possible for the router to figure out that it doesn't have a route,
but this depends on what the routing protocols are and how the router is
configured, and whether there are other network problems and whether who
ever was implementing the network stack thought that it was worth giving
this error back to the user.  The reason that the router can't find a route
is that the destination never was intended to exist in the first place,
or there is a hardware problem, or another router is down.

In the end, the end system isn't in a position to diagnose network problems.

The solution is to take your basic network management code whether it is
the HP version of the HP code, the IBM version of the HP code, or the DEC
version of the HP code, or the Cabletron or Sun code, and have it monitor
the network and when something looks screwy, it pages the network manager.

The solution for availability is to configure multiple paths to the destination
including multiple interfaces on the end systems.  However, neither TCP nor IP
can make use of these multiple paths.

What the network management code does is poll the network devices and if
it can't talk to something or some box shows increasing counts of dropped
packets, etc, then it calls for a human to find and fix the problem.  The
network management station might not report anything better than SS$_UNREACHABLE
but at least it will be placing that pager call.
5259.2ThanksOSEC::BURKEPPete, OSEC, UK SIThu Feb 27 1997 05:248
Mike,

thanks for your comments, as I said in the base note: this is more of a
political problem than a technical one. With your additional comments I can
hopefully go back and slug-this-one-out-again; I'm hoping for a points win in
the end. 

Pete.