[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ucrow::desktop_acms

Title:DECtp Desktop for ACMS
Moderator:UCROW::GIBSON
Created:Mon Sep 24 1990
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:859
Total number of notes:3034

853.0. "problems with failover" by CSC32::J_HENSON (Don't get even, get ahead!) Fri May 09 1997 13:18

acmsdi v2.2 (link date on di server 19-oct-1995), ovms v6.1, vax,
distributed

Healthnet is reporting a problem with failover by the DI server.  Their
configuration is as follows.

 - 3 FE vaxes running  acmsdi.  Also have regular acms users logged
   in.
 - clients are running on PCs using Visual Basic.  She didn't know
   what network, but will find out and let me know.
 - there is a BE/application node that is used by all three DI
   front end Vaxes, as well as the normal acms users.
 - there is a backup BE node, and they have a system logical defined
   similar to APPLICATION = NODE1::APPLICATION,NODE2::APPLICATION.

Recently, the backend application node crashed.  When this occurred,
all regular acms users and 2 of the 3 DI servers performed failover
as expected.  One of the DI servers began looping (I think).  When
I pressed for more detail, I was told that show proc revealed that
no I/Os were being done, and that the process was in CUR state.  This
doesn't seem right to me, so the customer may be a bit confused on this.

When the main applications node was restarted, they executed acms/reprocess
application.  Regular acms users were revectored to the primary
application node, and 2 of the 3 di servers did the same.  The 3rd
di server remained in a loop, and would not allow new user logins,
nor would it do any work for users currently logged in.  They had
to stop and restart this server in order to conintue processing.

This happened a second time, and all 3 di servers exhibited the
same problems/behavior as the one problem server from the previous backend 
crash.  They had to cycle all 3 di servers in order to get any
work done.

That is all I know about this problem.  The customer will be sending
swlup information, and I will post it when I get it, or at least provide
a pointer to it.

Any of this sound familiar to anyone?  Is there any other information
I should be getting?

Thanks,

Jerry
T.RTitleUserPersonal
Name
DateLines
853.1more infoCSC32::J_HENSONDon't get even, get ahead!Fri May 09 1997 14:2921
I have some additional information, but am not sure if it will help.

I did get the swlup log of the events that occurred while this
was happening.  I can make that available, but don't think it
will help.  The only even logged by the acmsdi server was an
invalid login attempt with an invalid password.  According to
the customer, this was logged AFTER the di server was stopped and
restarted.

While this was happening, the clients were logging error -3020, which
is that the di server has died.  However, the customer assures me
that the server was running the entire time, but in a CUR state
and not performing any i/o.  So, I'm confused (nothing unusual).

I have asked them to enable client logging on both the PCs and
the vax running the di server, but don't know what else to do.

Also, they're using tcp/ip (don't know whose) as their communications
layer.

Jerry
853.2UCROW::GIBSONFri May 16 1997 10:4014
    Hello Jerry,
    
    - Can you find out the version of ACMS that is running on all the
      systems?
      How many nodes are involved - 5?
    - Whether or not it is UCX or Multinet and the version. 
    - Are they running DECnet/OSI over the TCP/IP or are they using straight
      DECnet between the ACMS F/E and B/E machines? 
    - In what way did the original application node fail (so we might be
      able to try something here)?
    - It sounds like all the F/E nodes are running CP agents and that they
      all failover and back OK, just the Acmsdi agents fail to do so - right?
    
    /Tom