|
Re: .3 If you could find this information I would be very gratefull.
I have another query along the same lines which is way over my head.
If anyone has any ideas again.
Thanks,
Dave.
-------------------------------------------------------------------------
I am having some problems with a programmed LAT connection.
Essentially I am trying to introduce some error recovery in to my code
in the event of terminal server failures.
The basic code establishes connection to a terminal server port which
is connected to a watchdog timer. The timer sits between two systems,
via their respective terminal servers, and controls which system should
be live and which the standby. To do this it sends regularly messages
to each system and expects a response. As well as the watchdog monitoring
the systems, the systems need to know how to react in the event of a
watchdog failure.
In terms of programming the connections, the LAT ports are configured
through LATCP commands and the code connects to the port using
IO$M_LT_CONNECT as described in the IO users reference manual. Following
a successful connection, messages may be exchanged between watchdog and
system with no problems.
In order for the system to recognise the terminal server connection
breaking, a CTRL/Y HANGUP AST is set. The code is such that this is
set up prior to establishing the connection ( to ensure there is no
window where a hangup could be missed), and whenever it fires it is
re-enabled in the AST routine itself.
In the event of a HANGUP being detected, an event flag is set by the
HANGUP AST which is actioned in the mainline code - the action being
to initiate a connection using IO$M_LT_CONNECT. The routine which does
this sets a completion AST in order to action the connection
asynchronously. This AST recognises the possibility of a timeout, in which
case it posts another connection attempt.
Irrespective of the status of the LAT connection, the mainline code
continues to post QIO reads to the watchdog port at regular intervals
It has a completion AST which flags SS$_TIMEOUT whilst the terminal server
is down.
If I force the terminal server to reboot whilst everything is working,
I note the following behaviour which the code must cope with:
1) The HANGUP AST can take some time before it fires. Typically, since
the messages are exchanged regularly, it is the QIO read which fails
first, with a timeout shown in its AST, not the HANGUP AST.
2) Very occasionally the normal read AST can complete with SS$_HANGUP
rather than SS$_TIMEOUT.
3) Connection attempts which fail with a timeout because the terminal
server has yet to restart, also trigger the HANGUP AST (even though no
connection was established)
4) When a connection attempt is ultimately successful, in some instances
there can be a spurious HANGUP AST which fires immediately afterwards.
I know this to be spurious because the code responds by attempting
reconnection (again) and fails immediately with SS$_DEVACTIVE being flagged
in the AST completion AST.
My attempts to make the code capable of coping with failures have had to
deal with the observations above which together constitute a rather
non-deterministic model. Are there any known problems/quirks with either
IO$M_LT_CONNECT and/or CTL/Y HANGUP ASTS in the context of LAT connections?
|