[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:*OLD* ALL-IN-1 (tm) Support Conference
Notice:Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:IOSG::PYE
Created:Thu Jan 30 1992
Last Modified:Tue Jan 23 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:4343
Total number of notes:18308

2563.0. "FCS and %MCC-E-FATAL_FW..." by VNABRW::EHRLICH_K (With the Power & the Glory) Wed Apr 14 1993 10:45

Hi,

	I've done an upgrade from ALL-IN-1 V2.4 German with SFCP to ALL-IN-1 
V3.0-1 German the weekend before easter together with an colleague at a customer
site having approx. 300 User.
It's a two node Cluster running VMS V5.5-2 and a lot of SW.

Node BM01                                  Node BM02

VAX 4300                                   VAX 4300 
MR,MRG,MRX                                 MR,MRG,MRX
ALL-IN-1 V3.0-1                            ALL-IN-1 V3.0-1
This is the Network Node running           Here's running Pathworks only!
PSI
SNA

Yesterday they had problems with one Shared Drawer, Typ Regular Shared.
Customer wanted to change the ACE's via MGT MFC MD E. It seemed to work
but the ACE's were not updated. Suddenly noone in the ACL had acces to this 
drawer. Strange. I've looked for the problem and found that the FCS has locked
the ACCESS.DAT for this Drawer. So I did stop the FCS via MGT MFC MD MS STO
but the process never gets stopped. After a while I did a $ STOP/ID. and restart
ed the Server. Now it works, The drawer is also ok.

But I found following:

BM01:     

SYS$SYSROOT:<SYSMGR>OAFC$SERVER.log  only 6 blocks 
                               _error.log 0 blocks

BM02:

SYS$SYSROOT:<SYSMGR>OAFC$SERVER.log  only 124464 blocks !!!!!!!!
                               _error.log 0 blocks

Remember, the log files are both created on 5-Apr-1993.

Having a look in the large logfile you'll see thousands of 


13-APR-1993 07:39:46.32  Server: BM02::"73="  Error: %MCC-E-FATAL_FW, 
fatal framework condition:  !AS  Message: SrvTimeoutSysMan; 
receive alert to terminate thread

13-APR-1993 07:39:46.49  Server: BM02::"73="  Error: %MCC-E-FATAL_FW, 
fatal framework condition:  !AS  Message: SrvTimeoutSysMan; 
receive alert to terminate thread
.

This comes up every second or less.
I've deleted the file at customer, because the Diskspace ran out.

I've looked in TIMA, in this conference and in the document from Terry, but
I had no luck.

So, please, can you explain me why or what is the reason for this message
and what can I do to prevent this.

Best regards

Charly_from_CSC_Vienna
T.RTitleUserPersonal
Name
DateLines
2563.1Half an answerIOSG::STANDAGEIt&#039;s a Burgh kind of thingWed Apr 14 1993 12:1836
    
    
    Charly,
    
    It looks as though there was a problem deleting a system management
    thread which had exceeded it's timeout period. 
    
    When a normal user performs an operation which invokes the server,
    the task is undertaken by a server 'thread'. These threads exist 
    at server startup time and are 'woken up' when required for use, or
    put to 'sleep' when they are not needed. 
    
    System Management threads are slightly different as for performance
    reasons they remain 'awake' for a defined period of time after 
    becoming idle. This length of time is determined by the Session Timeout
    vakue in the server configuration file (SM MFC MS R to look at this
    value).
    
    It seems that a request was made to delete the thread, but it was not
    successful and so you got an endless loop of repeated attempts which
    where being logged.
    
    
    I haven't seen this happen before, but as it occurred shortly after
    upgrading to V3.0-1 it may well happen again soon. Alternatively, it
    may never happen again - so please keep an eye on the size of the
    server log files and let me know if you see this happening again - it's
    one of those situations which is difficult to reproduce.
    
    I'm sure if I've got any of this wrong I'll be corrected shortly ! :-) 
    
    
    Cheers,
    Kevin.
    
    
2563.2Do you mean by --Bob?VNABRW::EHRLICH_KWith the Power &amp; the GloryWed Apr 14 1993 12:3111
    Kevin,
    
    	thanx a lot for explanation. You're right, I'll keep an eye on
    the logfile on this node. It's my customer, therefore I've no problem
    to login whenever I (or you) need something 'special'.
    
    Let's have a cup of tea and wait ....
    
    Best regards & have a nice day
    Charly
    
2563.3Sounds like a nasty problemCHRLIE::HUSTONWed Apr 14 1993 20:0852
    
    re .1
    
    
    >I'm sure if I've got any of this wrong I'll be corrected shortly ! :-) 
    
    Is this soon enough? Actually it is mostly a techinicality, but I feel
    bad since I tried to explain this, obviously one of us missed 
    something :-)
    
    >When a normal user performs an operation which invokes the server,
    >the task is undertaken by a server 'thread'. These threads exist 
    >at server startup time and are 'woken up' when required for use, or
    >put to 'sleep' when they are not needed. 
    >
    >System Management threads are slightly different as for performance
    >reasons they remain 'awake' for a defined period of time after 
    >becoming idle. This length of time is determined by the Session Timeout
    >vakue in the server configuration file (SM MFC MS R to look at this
    >value).
    
    Close, very close, but not all correct. "Normal users" by which I mean
    people coming in from IOS doing an OafcOpenCabinet (ie non-system
    managers) will not use one of the background threads. Each request they
    make has a thread created specifically for it, when the task is done, 
    the thread is killed. Background threads are not related to users at 
    all. 
    
    System management requests are different, but only slightly. Since
    there is no explicit session from the users point of view, there is
    nothing for the user to terminate. What happens is behind the scenese
    the FCS does a open cabinet on behalf of the system manager and
    assigns it a session. Then when they make a call to the FCS a thread is
    created and killed just as before. What is timed out is the system 
    management session. 
    
    >It seems that a request was made to delete the thread, but it was not
    >successful and so you got an endless loop of repeated attempts which
    >where being logged.
    
    This part is right, what I THINK it is trying to do is to kill
    the background thread that periodically wakes up and timesout 
    system management sessions. This would also explain why the shutdown
    never finished. The thread is trying to kill itself but can't so
    it is looping. The server shutdown will not finish until all 
    background threads have shutdown. Cant' think of why this wouldn't
    work though. One of those things that someone needs to find a way to 
    reproduce and then go in and poke around and see.
    
    --Bob
    
    
2563.4New debugging aid, FYIIOSG::TALLETTGimmee an Alpha colour notebook...Fri Apr 16 1993 09:4913
    
    	Just FYI, there's a great looking hack in the HACKERS conference
    	that allows you to force any process in the system to enter the
    	debugger. You run this priveleged program, point it at the process
    	and you end up with a debug session on your terminal for the
    	process in question.
    
    	Might be useful for this type of problem as you could jump in
    	when the problem occurs. Might be useful to build the server
    	debug so you have a symbol table...
    
    Regards,
    Paul
2563.5and hand it over to you (:==:)!VNABRW::EHRLICH_KWith the Power &amp; the GloryTue Apr 20 1993 09:2814
    Good morning,
    
    	sorry for the delay, but I've caught a terrible cold and my voice
    is lost even today.
    
    So, I'm going to check in HACKERS for this tool. And when the problem
    occours anymore I'll do it. Then I'm going to hand over the results to
    you, won't I?
    
    FYI: The loop never happened til' today.
    
    Best Regards and thank y'all
    
    Charly_from_CSC_Vienna 
2563.6Observed details on %MCC-E-FATAL_FW, fatal frameworkGIDDAY::LEHWed May 26 1993 09:1051
"Fatal framework..." errors have been occurring in a number of sites running 
3.0-1 although the effects were either unknown or not recorded. One of them 
was shared drawers losing their share status and the sharers unable to access 
these drawers. 

Attempt to restore REGULAR share status via MFC MD MDT was not successful 
despite of the visual impression given when working in the involved form 
FC$MDT. ACL setups didn't seem to be affected by these changes

The log file OAFC$SERVER.LOG expanded very quickly with %MCC-E-FATAL_FW 
events, and the sequence was very much like:

11-MAR-1993 23:02:47.85  Server: ADL01V::"73="
Message: Startup for File Cabinet Server V1.0-2 complete

followed by , but not always in same day

11-MAR-1993 23:07:41.62  Server: ADL01V::"73="
Error: %DSL-W-SHUT, Network shutdown
Message: Shutting Down server, network failure.

followed by 3 or 4 %MCC-E-ALERT_TERMREQ events

11-MAR-1993 23:07:42.28  Server: ADL01V::"73="
Error: %MCC-E-ALERT_TERMREQ, thread termination requested
Message: SrvBufferProcess; receive alert to terminate thread

then followed by some 4,000+ MCC-E-FATAL_FW events happening in
around 10 minutes

11-MAR-1993 23:07:44.11  Server: ADL01V::"73="
Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS
Message: SrvTimeoutSysMan; receive alert to terminate thread

Sometimes the FC server startup encountered

15-MAR-1993 23:05:30.12  Server: ADL01V::"73="  Error: %OAFC-W-NETLOST,
The network connection to the File Cabinet Server was lost
Message: Network lost, server config record not updated.

but seemed to recover with %DSL-W-SHUT followed by another FC server startup, 
which ocurred immediately after.

On the system where the above details were collected, %MCC-E-FATAL_FW events
happened on the same day 3.0-1 was put on; and for 4 previous months running 
3.0, there was no single event of same type

Thanks for any comments

Hong
CSC Sydney                                            
2563.7Seen it, but not easy to reproduce...IOSG::STANDAGEWed May 26 1993 11:5221
    
    
    Hong,
    
    Yes, I've seen this too, but only in the following instance.
    
    One of our machines was being shutdown, and for some reason the
    %MCC-E-FATAL_FW error was logged multiple times by the server just
    before everything went dead. Usually you do just get the DSL shutdown
    message and the FCS dies quietly. I've never seen this message produced
    during the normal day-to-day life of the server, so the impact of this
    I cannot determine for sure. Reproducing such things is a very
    difficult task - but we have bugged the problem and we're monitoring
    the frequency at which it happens...
    
    I'll append your comments to the bug report.
    
    Thanks,
    Kevin.
    
    
2563.8Not sure if this is it or notCHRLIE::HUSTONWed May 26 1993 14:1025
    
    In a normal, by the book shutdown you won't get the severe mcc errors
    that you are seeing. Just the ones about requests to terminate 
    threads (which aren't really errors, that is the FCS killing all
    the background threads, they are logged as errors since the internal
    mcc workings notify a thread to commit suicide by waking it up with
    an error code.)
    
    I THINK the reason that you are getting the severe errors is this:
    The network is shutting down (for whatever reason-- DASL problem, 
    system shutdown, DECnet shutdown, whatever), when DASL is told
    to shut down, it notifies all it's current applications, FCS being one
    of them, to immediately stop. IF you have a task currently executing, 
    the task will be very un-nicely shot, potentially in mid-task, this
    may be what is causing the MCC errors you are seeing. You can
    potentially have several threads per task, and each could have 
    multiple mutexes locked. When the FCS is told to stop, by DASL, the
    FCS does not have the chance to nicely run down tasks, it can't, the
    network is about to go down, and there is no mechanism for the FCS
    to ask DASL to wait a few minutes before it dies.
    
    Not sure if that is the cause, but it would explain it.
    
    --Bob
    
2563.9what about loss of regular sharing mode ?GIDDAY::LEHWed May 26 1993 14:3818
    Kevin and Bob in .7 and .8
                                                   
    Thanks for shedding lights on the possible cause. Will keep an eye on
    this error.
    
    Today saw similar incidents at another site just going to 3.0-1 for
    about 2 weeks, where regular sha drawerswent back to non-share mode and
    and the only known fix was to restart FCS.
    
    We're trying to collect from the customers any unusual tasks they'd
    done or any observations they'd have had, but if this keeps happening,
    extreme hardships would be felt and pressures mounted.
    
    Apart from tracing, any other advices ?
    
    Thanks
    
    Hong
2563.10How's the network? Is it stable?CHRLIE::HUSTONWed May 26 1993 18:4922
    
    re .9
    
    >Today saw similar incidents at another site just going to 3.0-1 for
    >about 2 weeks, where regular sha drawerswent back to non-share mode and
    >and the only known fix was to restart FCS.
    
    It sounds like the internal drawer cache is getting corrupt, especially
    if you fix it by restarting the FCS. The FCS maintains a cache of
    security information for  open drawers. How long did you wait before
    re-starting the FCS once you noticed the drawer sharing type was
    changed? The reason I ask is that this security informaion will time
    out fairly quickly (default is 10 minutes I think). If the problem is
    the cache, then the time out will fix it.
    
    Other advice? Nope, I also am not sure if tracing will help, all it
    will tell you is what actions the FCS was told to do, it won't
    tell you who mangled internal structures.
    
    --Bob
    
    
2563.11Cache is not being flushed after 10 minsBUSHIE::SETHIAhhhh (-: an upside down smile from OZFri May 28 1993 02:5127
    Hi Bob,
    
    >It sounds like the internal drawer cache is getting corrupt, especially
    >if you fix it by restarting the FCS. The FCS maintains a cache of
    >security information for  open drawers. How long did you wait before
    >re-starting the FCS once you noticed the drawer sharing type was
    >changed? The reason I ask is that this security informaion will time
    >out fairly quickly (default is 10 minutes I think). If the problem is
    >the cache, then the time out will fix it.
    
    I am working on the same problem at another site and I can confirm that
    the customer waited a few hours before stopping the FCS.  So it looks
    like the cache is not getting flushed.
    
    I have a copy of the logfile on
    RIPPER::USER$TSC:[SETHI]OAFC$SERVER.DIGITAL_COPY and the file
    protection is set to w:r.  This logfile has a number of problems
    reported and I will reply to notes that are already discussing the
    problem.
    
    The customer has tuned his FCS as per the manual and still has the
    problem.  I would be grateful for an feedback you may have.  Is there
    anything you want me to do to get you extra information ?
    
    Regards,
    
    Sunil
2563.12That's really strangeCHRLIE::HUSTONFri May 28 1993 15:539
    
    Can you post the log file, and also do a SM MFC MS R and see 
    what the authorization timeout is.
    
    It is possible that the fatal thread error is the authorization 
    timeout thread. 
    
    --Bob
    
2563.13Preferably a *pointer* to the log file :-)IOSG::PYEGraham - ALL-IN-1 Sorcerer&#039;s ApprenticeFri May 28 1993 18:150
2563.14All the info you requiredROMA::TEPTue Jun 01 1993 03:1723
    Hi Bob,
    
    The server attributes are as follows for VAX1, VAX6 and VAX8 in the
    cluster:
    
    Authorized Timeout: 600 Max Client Connects: 512   
    Distribution: OFF       Max # of Drawers: 140   
    Drawer Cache: 50        Object Number:  73      
    Drawer Timeout: 600     Session Timeout:  1200   
    Drawer Collisions: 0   
      
    >Can you post the log file
    
    As per my previous note the logfile can be found on
    RIPPER::USER$TSC:[SETHI]OAFC$SERVER.DIGITAL_COPY;1, the protection is
    set to w:r.
    
    Regards,
    
    Sunil
    
    NB - It's nice working in Nashua taking calls from customers (:==:)!!!
    I am sure Andrew will know why ???!!!!
2563.15ANOTHER OCCURRENCE OF MCC-E-FATAL_FWAIMTEC::GRENIER_JMon Jun 07 1993 15:5011
    One of our customers received the error about MCC-E-FATAL_FW
    "Fatal Framework Condition" in his log files when checking to
    see why the FCS did not start after running EW???  He shuts
    down ALL-IN-1 for EW and the last two weeks the FCS did not
    start on any of his three nodes.  Prior to this it was working.
    That was the only "error type" message he could find in his
    log files.
    
    Hints???
    
    Jean
2563.16Hhhmmm, at our cluster, tooVNABRW::EHRLICH_KHealth &amp; a long life with youMon Jun 07 1993 15:5812
    Hi,
    	last week I've detected the same at our CSC cluster here in Vienna.
    BUT the customer as mentioned in .0 has never ran into this behaviour.
    again. AND I'm really one day/every week at his site. 
    
    But our cluster isn't very constant now. We have great problems with
    the heat in the room, so one or several nodes are 'leaving' the cluster
    every day. This could be the problem with the network!
    
    I don't have any ideas anymore here. Sorry.
    Regards
    Charly_from_CSC_Vienna
2563.17Some more infos for y'all!VNABRW::EHRLICH_KHealth &amp; a long life with youMon Jun 21 1993 12:3681
Hi Kevin&Bob

	I've done some investigations for %MCC-E-FATAL_FW.
Maybe this can help you a little bit more.

sys$manager:oafc$server.log
---------------------------

...
18-MAY-1993 20:24:50.92  Server: VNACO1::"73="  Error: %DSL-W-SHUT, Network shutdown  Message: Shutting Down server, network failure.
18-MAY-1993 20:24:52.43  Server: VNACO1::"73="  Error: %MCC-E-ALERT_TERMREQ, thread termination requested  Message: SrvBufferProcess; receive alert to terminate thread
18-MAY-1993 20:24:52.95  Server: VNACO1::"73="  Error: %MCC-E-ALERT_TERMREQ, thread termination requested  Message: SrvMemoryScavenge; receive alert to terminate thread
18-MAY-1993 20:24:54.80  Server: VNACO1::"73="  Error: %MCC-E-ALERT_TERMREQ, thread termination requested  Message: CsiCacheBlockAstService; Error from mcc_astevent_receive
18-MAY-1993 20:24:55.08  Server: VNACO1::"73="  Error: %MCC-E-ALERT_TERMREQ, thread termination requested  Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:24:55.17  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:24:55.26  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:24:55.36  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:24:55.53  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
...
18-MAY-1993 20:30:13.71  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:30:13.84  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:30:14.01  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:30:14.15  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:30:14.33  Server: VNACO1::"73="  Error: %MCC-E-FATAL_FW, fatal framework condition:  !AS  Message: SrvTimeoutSysMan; receive alert to terminate thread
19-MAY-1993 08:37:01.41  Server: VNACO1::"73="  Message: Startup for File Cabinet Server V1.0-2 complete

sys$manager:operator.log
------------------------

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:24:51.03  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:24:50.89)
Message from user DFS$COM_ACP on VNACO1
%DFS-I-STOPPING,  DFS/COM Stopping

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:24:51.04  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:24:50.91)
Message from user DFS$00010001 on VNACO1
%DFS-I-SRVEXIT, Server exiting (DFS$00010001_1) 18-MAY-1993 20:24:50.87

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:29:40.68  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:29:40.67)
Message from user UCX$NFS on VNACO1
%UCX-I-NFS_DISERR, Disable writing into NFS errlog file

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:29:52.70  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:29:52.69)
Message from user INTERnet on VNACO1
INTERnet ACP Remote Terminal Services STOP - RLOGIN

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:29:52.72  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:29:52.71)
Message from user INTERnet on VNACO1
INTERnet ACP Remote Terminal Services STOP - TELNET

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:29:56.65  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:29:56.64)
Message from user INTERnet on VNACO1
INTERnet Shutdown

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:30:33.81  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:30:33.79)
$1$DUA0: (VNA03) has been removed from shadow set.

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:30:33.83  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:30:33.80)
$1$DUA2: (VNA03) has been removed from shadow set.

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:30:51.55  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:30:51.54)
Message from user AUDIT$SERVER on VNACO1
Security alarm (SECURITY) and security audit (SECURITY) on VNACO1, system id: 49312
Auditable event:        Audit server shutting down
Event time:             18-MAY-1993 20:30:51.52
PID:                    2640027C

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:30:51.74  %%%%%%%%%%%    (from node VNACO1 at 18-MAY-1993 20:30:51.71)
Message from user  on VNACO1
_VNACO1$VTA2:, VNACO1 shutdown was requested by the operator.

%%%%%%%%%%%  OPCOM  18-MAY-1993 20:30:52.03  %%%%%%%%%%%
Operator _VNACO1$VTA2: has been disabled, username
....
%%%%%%%%%%%  OPCOM  18-MAY-1993 20:31:56.86  %%%%%%%%%%%
20:31:09.79 Node VNACO1 (csid 00010032) has been removed from the VAXcluster

Hhmmm, guess what?

Regards 
Charly
    
2563.18Simple problem - must wait thoughIOSG::CHINNICKgone walkaboutFri Jul 09 1993 11:5116
    
    OK... I think that this is a really simple problem...
    
    When the network is shutting down or FCS is trying to shutdown
    all of its active threads are signalled to terminate. [By MCC returning
    the ALERT_TERMREQ status.]
    
    The SrvSysManTimeOut thread simply doesn't understand this status and
    tries to continue execution after it has been told to die. It gets
    further errors - notably the FATAL_FW and loops interminably.
    
    We are planning to fix this but it will be some small time before we
    have a new FCS to ship. More details when they come to hand.
    
    
    Paul.
2563.19Oh! Looks pretty!VNABRW::EHRLICH_KRonnie James DIO, vocals!Fri Jul 09 1993 15:071