[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference lassie::ucx

Title:	DEC TCP/IP Services for OpenVMS
Notice:	Note 2-SSB Kits, 3-FT Kits, 4-Patch Info, 7-QAR System
Moderator:	ucxaxp.ucx.lkg.dec.com::TIBBERT

Created:	Thu Nov 17 1994
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5568
Total number of notes:	21492

5259.0. "Real cause of SS$_UNREACHABLE from $QIO?" by OSEC::BURKEP (Pete, OSEC, UK SI) Fri Feb 21 1997 06:14

Hello,

I am working on a very high-availability system; one that cannot go down. It
has a number of mechanisms to ensure the availability. One is to voluntarily
shut-down processes that hit unusual errors, assuming that the process is
internally screwed. Any missing process is then automatically restarted. If it
disappears too many times in a five-minute period then it is no longer
re-started; this is to avoid continuous re-starts causing more problems than
the disappearing process.

A critical process, using UCX $QIO IO$_WRITEVBLKs, returned SS$_UNREACHABLE
because of transient external network errors. This resulted in the critical
process shutting itself down too many times and resulted in a loss of service.

This was politically very bad for us and the customer is insisting we improve
our error handling so that this won't happen in the future. My problem is that
I can't see how I can tell, from just a $QIO return status, whether the problem
is an external network error (when the process shouldn't shut down) or another
internal (programming?) error (when shutting down the process is correct).
Examining the 'System Services and C Socket Programming Manual' shows my
problem. The particular condition value returned from the $QIO IO$_WRITEVBLK
was:

SS$_UNREACHABLE         Programming error:  Either the
                        	network address is invalid or the
                                network is unreachable.
                     	Hardware error: The data link
                               	adapter detected an error and
                                shut itself off. The UCX software
                                is waiting for the adapter to come
                                back on line.

but this may not be the only problem one. The problem is that I can't tell
whether the problem was 'network address is invalid' which is likely to be an
internal programming error, or 'network is unreachable' which is likely to be
an external network error. It would also be nice to know if the problem is a
hardware error. 

So, I have a couple of questions:

1)  Can I find out any more about what the error is in this situation? Can I
distinguish between the different causes of SS$_UNREACHABLE.

2)  Is the list of Condition Values Returned in the manual complete? I may have
to make some arbitrary decisions on each of the condition values and say this
is 99% likely to be an internal error, or external error. So what I dont want
are any other condition values, other than those documented, creeping out of
the woodwork.

I know this may seem a little extreme, but this is politically very bad for us.

Thanks,
Pete.

T.R	Title	User	Personal Name	Date	Lines
5259.1	Gee, if anyone can solve this problem, they should get a Noble prize by 2001	twick.nio.dec.com::PETTENGILL	mulp	`Mon Feb 24 1997 22:17`	57
	The problem you have is that someone read the error message documentation and assumed that literally meant whet it says. The preface should be "If the network is working ok, then" The second problem is that this is a high availability network based on IP. Bad move. You really have to know what you are doing to build a highly available application on top of an existing protocol stack. DECnet-PLUS is built on OSI and we got lots of availability options included in OSI so if you are smart and lucky, you might be able to get DECnet-PLUS to do the work for you, but it won't be hard. VMSclusters would do ok as well, but that wouldn't be cheap and you really don't want to do realtime in a cluster of any sort. Basically, every network protocol launches a packet into the ether and waits for a reply. If there is no reply, then it tries again. After a while it reports IE.NFW, or equivalent. (IE.NFS is a classic RSX error - IE.NFW - path lost to partner). The reasons that the packet didn't get to the destination are a hardware error the intervening network is broken the destination isn't working the destination isn't working because is was never intended to exist It would be nice if a node that didn't exist would send back an error message saying, "I don't exist", but no one has quite figured out how to implement it. Likewise, it would be nice if a node that wasn't up would send back a message saying "I can't talk with you right now because I'm not ready to send messages" but no one has figured that one out either. Now, it is possible for an intervening router to send back a message that says, I don't know how to get to the network; the ICMP message exists and it is possible for the router to figure out that it doesn't have a route, but this depends on what the routing protocols are and how the router is configured, and whether there are other network problems and whether who ever was implementing the network stack thought that it was worth giving this error back to the user. The reason that the router can't find a route is that the destination never was intended to exist in the first place, or there is a hardware problem, or another router is down. In the end, the end system isn't in a position to diagnose network problems. The solution is to take your basic network management code whether it is the HP version of the HP code, the IBM version of the HP code, or the DEC version of the HP code, or the Cabletron or Sun code, and have it monitor the network and when something looks screwy, it pages the network manager. The solution for availability is to configure multiple paths to the destination including multiple interfaces on the end systems. However, neither TCP nor IP can make use of these multiple paths. What the network management code does is poll the network devices and if it can't talk to something or some box shows increasing counts of dropped packets, etc, then it calls for a human to find and fix the problem. The network management station might not report anything better than SS$_UNREACHABLE but at least it will be placing that pager call.
5259.2	Thanks	OSEC::BURKEP	Pete, OSEC, UK SI	`Thu Feb 27 1997 05:24`	8
	Mike, thanks for your comments, as I said in the base note: this is more of a political problem than a technical one. With your additional comments I can hopefully go back and slug-this-one-out-again; I'm hoping for a points win in the end. Pete.