Title: | DEC TCP/IP Services for OpenVMS |
Notice: | Note 2-SSB Kits, 3-FT Kits, 4-Patch Info, 7-QAR System |
Moderator: | ucxaxp.ucx.lkg.dec.com::TIBBERT |
Created: | Thu Nov 17 1994 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 5568 |
Total number of notes: | 21492 |
Hello, I am working on a very high-availability system; one that cannot go down. It has a number of mechanisms to ensure the availability. One is to voluntarily shut-down processes that hit unusual errors, assuming that the process is internally screwed. Any missing process is then automatically restarted. If it disappears too many times in a five-minute period then it is no longer re-started; this is to avoid continuous re-starts causing more problems than the disappearing process. A critical process, using UCX $QIO IO$_WRITEVBLKs, returned SS$_UNREACHABLE because of transient external network errors. This resulted in the critical process shutting itself down too many times and resulted in a loss of service. This was politically very bad for us and the customer is insisting we improve our error handling so that this won't happen in the future. My problem is that I can't see how I can tell, from just a $QIO return status, whether the problem is an external network error (when the process shouldn't shut down) or another internal (programming?) error (when shutting down the process is correct). Examining the 'System Services and C Socket Programming Manual' shows my problem. The particular condition value returned from the $QIO IO$_WRITEVBLK was: SS$_UNREACHABLE Programming error: Either the network address is invalid or the network is unreachable. Hardware error: The data link adapter detected an error and shut itself off. The UCX software is waiting for the adapter to come back on line. but this may not be the only problem one. The problem is that I can't tell whether the problem was 'network address is invalid' which is likely to be an internal programming error, or 'network is unreachable' which is likely to be an external network error. It would also be nice to know if the problem is a hardware error. So, I have a couple of questions: 1) Can I find out any more about what the error is in this situation? Can I distinguish between the different causes of SS$_UNREACHABLE. 2) Is the list of Condition Values Returned in the manual complete? I may have to make some arbitrary decisions on each of the condition values and say this is 99% likely to be an internal error, or external error. So what I dont want are any other condition values, other than those documented, creeping out of the woodwork. I know this may seem a little extreme, but this is politically very bad for us. Thanks, Pete.
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
5259.1 | Gee, if anyone can solve this problem, they should get a Noble prize by 2001 | twick.nio.dec.com::PETTENGILL | mulp | Mon Feb 24 1997 22:17 | 57 |
The problem you have is that someone read the error message documentation and assumed that literally meant whet it says. The preface should be "If the network is working ok, then" The second problem is that this is a high availability network based on IP. Bad move. You really have to know what you are doing to build a highly available application on top of an existing protocol stack. DECnet-PLUS is built on OSI and we got lots of availability options included in OSI so if you are smart and lucky, you might be able to get DECnet-PLUS to do the work for you, but it won't be hard. VMSclusters would do ok as well, but that wouldn't be cheap and you really don't want to do realtime in a cluster of any sort. Basically, every network protocol launches a packet into the ether and waits for a reply. If there is no reply, then it tries again. After a while it reports IE.NFW, or equivalent. (IE.NFS is a classic RSX error - IE.NFW - path lost to partner). The reasons that the packet didn't get to the destination are a hardware error the intervening network is broken the destination isn't working the destination isn't working because is was never intended to exist It would be nice if a node that didn't exist would send back an error message saying, "I don't exist", but no one has quite figured out how to implement it. Likewise, it would be nice if a node that wasn't up would send back a message saying "I can't talk with you right now because I'm not ready to send messages" but no one has figured that one out either. Now, it is possible for an intervening router to send back a message that says, I don't know how to get to the network; the ICMP message exists and it is possible for the router to figure out that it doesn't have a route, but this depends on what the routing protocols are and how the router is configured, and whether there are other network problems and whether who ever was implementing the network stack thought that it was worth giving this error back to the user. The reason that the router can't find a route is that the destination never was intended to exist in the first place, or there is a hardware problem, or another router is down. In the end, the end system isn't in a position to diagnose network problems. The solution is to take your basic network management code whether it is the HP version of the HP code, the IBM version of the HP code, or the DEC version of the HP code, or the Cabletron or Sun code, and have it monitor the network and when something looks screwy, it pages the network manager. The solution for availability is to configure multiple paths to the destination including multiple interfaces on the end systems. However, neither TCP nor IP can make use of these multiple paths. What the network management code does is poll the network devices and if it can't talk to something or some box shows increasing counts of dropped packets, etc, then it calls for a human to find and fix the problem. The network management station might not report anything better than SS$_UNREACHABLE but at least it will be placing that pager call. | |||||
5259.2 | Thanks | OSEC::BURKEP | Pete, OSEC, UK SI | Thu Feb 27 1997 05:24 | 8 |
Mike, thanks for your comments, as I said in the base note: this is more of a political problem than a technical one. With your additional comments I can hopefully go back and slug-this-one-out-again; I'm hoping for a points win in the end. Pete. |