[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:	OLD ALL-IN-1 (tm) Support Conference
Notice:	Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:	IOSG::PYE

Created:	Thu Jan 30 1992
Last Modified:	Tue Jan 23 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4343
Total number of notes:	18308

2563.0. "FCS and %MCC-E-FATAL_FW..." by VNABRW::EHRLICH_K (With the Power & the Glory) Wed Apr 14 1993 09:45

Hi,

	I've done an upgrade from ALL-IN-1 V2.4 German with SFCP to ALL-IN-1 
V3.0-1 German the weekend before easter together with an colleague at a customer
site having approx. 300 User.
It's a two node Cluster running VMS V5.5-2 and a lot of SW.

Node BM01                                  Node BM02

VAX 4300                                   VAX 4300 
MR,MRG,MRX                                 MR,MRG,MRX
ALL-IN-1 V3.0-1                            ALL-IN-1 V3.0-1
This is the Network Node running           Here's running Pathworks only!
PSI
SNA

Yesterday they had problems with one Shared Drawer, Typ Regular Shared.
Customer wanted to change the ACE's via MGT MFC MD E. It seemed to work
but the ACE's were not updated. Suddenly noone in the ACL had acces to this 
drawer. Strange. I've looked for the problem and found that the FCS has locked
the ACCESS.DAT for this Drawer. So I did stop the FCS via MGT MFC MD MS STO
but the process never gets stopped. After a while I did a $ STOP/ID. and restart
ed the Server. Now it works, The drawer is also ok.

But I found following:

BM01:     

SYS$SYSROOT:<SYSMGR>OAFC$SERVER.log  only 6 blocks 
                               _error.log 0 blocks

BM02:

SYS$SYSROOT:<SYSMGR>OAFC$SERVER.log  only 124464 blocks !!!!!!!!
                               _error.log 0 blocks

Remember, the log files are both created on 5-Apr-1993.

Having a look in the large logfile you'll see thousands of 


13-APR-1993 07:39:46.32  Server: BM02::"73="  Error: %MCC-E-FATAL_FW, 
fatal framework condition:  !AS  Message: SrvTimeoutSysMan; 
receive alert to terminate thread

13-APR-1993 07:39:46.49  Server: BM02::"73="  Error: %MCC-E-FATAL_FW, 
fatal framework condition:  !AS  Message: SrvTimeoutSysMan; 
receive alert to terminate thread
.

This comes up every second or less.
I've deleted the file at customer, because the Diskspace ran out.

I've looked in TIMA, in this conference and in the document from Terry, but
I had no luck.

So, please, can you explain me why or what is the reason for this message
and what can I do to prevent this.

Best regards

Charly_from_CSC_Vienna

T.R	Title	User	Personal Name	Date	Lines
2563.1	Half an answer	IOSG::STANDAGE	It's a Burgh kind of thing	`Wed Apr 14 1993 11:18`	36
	Charly, It looks as though there was a problem deleting a system management thread which had exceeded it's timeout period. When a normal user performs an operation which invokes the server, the task is undertaken by a server 'thread'. These threads exist at server startup time and are 'woken up' when required for use, or put to 'sleep' when they are not needed. System Management threads are slightly different as for performance reasons they remain 'awake' for a defined period of time after becoming idle. This length of time is determined by the Session Timeout vakue in the server configuration file (SM MFC MS R to look at this value). It seems that a request was made to delete the thread, but it was not successful and so you got an endless loop of repeated attempts which where being logged. I haven't seen this happen before, but as it occurred shortly after upgrading to V3.0-1 it may well happen again soon. Alternatively, it may never happen again - so please keep an eye on the size of the server log files and let me know if you see this happening again - it's one of those situations which is difficult to reproduce. I'm sure if I've got any of this wrong I'll be corrected shortly ! :-) Cheers, Kevin.
2563.2	Do you mean by --Bob?	VNABRW::EHRLICH_K	With the Power & the Glory	`Wed Apr 14 1993 11:31`	11
	Kevin, thanx a lot for explanation. You're right, I'll keep an eye on the logfile on this node. It's my customer, therefore I've no problem to login whenever I (or you) need something 'special'. Let's have a cup of tea and wait .... Best regards & have a nice day Charly
2563.3	Sounds like a nasty problem	CHRLIE::HUSTON		`Wed Apr 14 1993 19:08`	52
	re .1 >I'm sure if I've got any of this wrong I'll be corrected shortly ! :-) Is this soon enough? Actually it is mostly a techinicality, but I feel bad since I tried to explain this, obviously one of us missed something :-) >When a normal user performs an operation which invokes the server, >the task is undertaken by a server 'thread'. These threads exist >at server startup time and are 'woken up' when required for use, or >put to 'sleep' when they are not needed. > >System Management threads are slightly different as for performance >reasons they remain 'awake' for a defined period of time after >becoming idle. This length of time is determined by the Session Timeout >vakue in the server configuration file (SM MFC MS R to look at this >value). Close, very close, but not all correct. "Normal users" by which I mean people coming in from IOS doing an OafcOpenCabinet (ie non-system managers) will not use one of the background threads. Each request they make has a thread created specifically for it, when the task is done, the thread is killed. Background threads are not related to users at all. System management requests are different, but only slightly. Since there is no explicit session from the users point of view, there is nothing for the user to terminate. What happens is behind the scenese the FCS does a open cabinet on behalf of the system manager and assigns it a session. Then when they make a call to the FCS a thread is created and killed just as before. What is timed out is the system management session. >It seems that a request was made to delete the thread, but it was not >successful and so you got an endless loop of repeated attempts which >where being logged. This part is right, what I THINK it is trying to do is to kill the background thread that periodically wakes up and timesout system management sessions. This would also explain why the shutdown never finished. The thread is trying to kill itself but can't so it is looping. The server shutdown will not finish until all background threads have shutdown. Cant' think of why this wouldn't work though. One of those things that someone needs to find a way to reproduce and then go in and poke around and see. --Bob
2563.4	New debugging aid, FYI	IOSG::TALLETT	Gimmee an Alpha colour notebook...	`Fri Apr 16 1993 08:49`	13
	Just FYI, there's a great looking hack in the HACKERS conference that allows you to force any process in the system to enter the debugger. You run this priveleged program, point it at the process and you end up with a debug session on your terminal for the process in question. Might be useful for this type of problem as you could jump in when the problem occurs. Might be useful to build the server debug so you have a symbol table... Regards, Paul
2563.5	and hand it over to you (:==:)!	VNABRW::EHRLICH_K	With the Power & the Glory	`Tue Apr 20 1993 08:28`	14
	Good morning, sorry for the delay, but I've caught a terrible cold and my voice is lost even today. So, I'm going to check in HACKERS for this tool. And when the problem occours anymore I'll do it. Then I'm going to hand over the results to you, won't I? FYI: The loop never happened til' today. Best Regards and thank y'all Charly_from_CSC_Vienna
2563.6	Observed details on %MCC-E-FATAL_FW, fatal framework	GIDDAY::LEH		`Wed May 26 1993 08:10`	51
	"Fatal framework..." errors have been occurring in a number of sites running 3.0-1 although the effects were either unknown or not recorded. One of them was shared drawers losing their share status and the sharers unable to access these drawers. Attempt to restore REGULAR share status via MFC MD MDT was not successful despite of the visual impression given when working in the involved form FC$MDT. ACL setups didn't seem to be affected by these changes The log file OAFC$SERVER.LOG expanded very quickly with %MCC-E-FATAL_FW events, and the sequence was very much like: 11-MAR-1993 23:02:47.85 Server: ADL01V::"73=" Message: Startup for File Cabinet Server V1.0-2 complete followed by , but not always in same day 11-MAR-1993 23:07:41.62 Server: ADL01V::"73=" Error: %DSL-W-SHUT, Network shutdown Message: Shutting Down server, network failure. followed by 3 or 4 %MCC-E-ALERT_TERMREQ events 11-MAR-1993 23:07:42.28 Server: ADL01V::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvBufferProcess; receive alert to terminate thread then followed by some 4,000+ MCC-E-FATAL_FW events happening in around 10 minutes 11-MAR-1993 23:07:44.11 Server: ADL01V::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread Sometimes the FC server startup encountered 15-MAR-1993 23:05:30.12 Server: ADL01V::"73=" Error: %OAFC-W-NETLOST, The network connection to the File Cabinet Server was lost Message: Network lost, server config record not updated. but seemed to recover with %DSL-W-SHUT followed by another FC server startup, which ocurred immediately after. On the system where the above details were collected, %MCC-E-FATAL_FW events happened on the same day 3.0-1 was put on; and for 4 previous months running 3.0, there was no single event of same type Thanks for any comments Hong CSC Sydney
2563.7	Seen it, but not easy to reproduce...	IOSG::STANDAGE		`Wed May 26 1993 10:52`	21
	Hong, Yes, I've seen this too, but only in the following instance. One of our machines was being shutdown, and for some reason the %MCC-E-FATAL_FW error was logged multiple times by the server just before everything went dead. Usually you do just get the DSL shutdown message and the FCS dies quietly. I've never seen this message produced during the normal day-to-day life of the server, so the impact of this I cannot determine for sure. Reproducing such things is a very difficult task - but we have bugged the problem and we're monitoring the frequency at which it happens... I'll append your comments to the bug report. Thanks, Kevin.
2563.8	Not sure if this is it or not	CHRLIE::HUSTON		`Wed May 26 1993 13:10`	25
	In a normal, by the book shutdown you won't get the severe mcc errors that you are seeing. Just the ones about requests to terminate threads (which aren't really errors, that is the FCS killing all the background threads, they are logged as errors since the internal mcc workings notify a thread to commit suicide by waking it up with an error code.) I THINK the reason that you are getting the severe errors is this: The network is shutting down (for whatever reason-- DASL problem, system shutdown, DECnet shutdown, whatever), when DASL is told to shut down, it notifies all it's current applications, FCS being one of them, to immediately stop. IF you have a task currently executing, the task will be very un-nicely shot, potentially in mid-task, this may be what is causing the MCC errors you are seeing. You can potentially have several threads per task, and each could have multiple mutexes locked. When the FCS is told to stop, by DASL, the FCS does not have the chance to nicely run down tasks, it can't, the network is about to go down, and there is no mechanism for the FCS to ask DASL to wait a few minutes before it dies. Not sure if that is the cause, but it would explain it. --Bob
2563.9	what about loss of regular sharing mode ?	GIDDAY::LEH		`Wed May 26 1993 13:38`	18
	Kevin and Bob in .7 and .8 Thanks for shedding lights on the possible cause. Will keep an eye on this error. Today saw similar incidents at another site just going to 3.0-1 for about 2 weeks, where regular sha drawerswent back to non-share mode and and the only known fix was to restart FCS. We're trying to collect from the customers any unusual tasks they'd done or any observations they'd have had, but if this keeps happening, extreme hardships would be felt and pressures mounted. Apart from tracing, any other advices ? Thanks Hong
2563.10	How's the network? Is it stable?	CHRLIE::HUSTON		`Wed May 26 1993 17:49`	22
	re .9 >Today saw similar incidents at another site just going to 3.0-1 for >about 2 weeks, where regular sha drawerswent back to non-share mode and >and the only known fix was to restart FCS. It sounds like the internal drawer cache is getting corrupt, especially if you fix it by restarting the FCS. The FCS maintains a cache of security information for open drawers. How long did you wait before re-starting the FCS once you noticed the drawer sharing type was changed? The reason I ask is that this security informaion will time out fairly quickly (default is 10 minutes I think). If the problem is the cache, then the time out will fix it. Other advice? Nope, I also am not sure if tracing will help, all it will tell you is what actions the FCS was told to do, it won't tell you who mangled internal structures. --Bob
2563.11	Cache is not being flushed after 10 mins	BUSHIE::SETHI	Ahhhh (-: an upside down smile from OZ	`Fri May 28 1993 01:51`	27
	Hi Bob, >It sounds like the internal drawer cache is getting corrupt, especially >if you fix it by restarting the FCS. The FCS maintains a cache of >security information for open drawers. How long did you wait before >re-starting the FCS once you noticed the drawer sharing type was >changed? The reason I ask is that this security informaion will time >out fairly quickly (default is 10 minutes I think). If the problem is >the cache, then the time out will fix it. I am working on the same problem at another site and I can confirm that the customer waited a few hours before stopping the FCS. So it looks like the cache is not getting flushed. I have a copy of the logfile on RIPPER::USER$TSC:[SETHI]OAFC$SERVER.DIGITAL_COPY and the file protection is set to w:r. This logfile has a number of problems reported and I will reply to notes that are already discussing the problem. The customer has tuned his FCS as per the manual and still has the problem. I would be grateful for an feedback you may have. Is there anything you want me to do to get you extra information ? Regards, Sunil
2563.12	That's really strange	CHRLIE::HUSTON		`Fri May 28 1993 14:53`	9
	Can you post the log file, and also do a SM MFC MS R and see what the authorization timeout is. It is possible that the fatal thread error is the authorization timeout thread. --Bob
2563.13	Preferably a pointer to the log file :-)	IOSG::PYE	Graham - ALL-IN-1 Sorcerer's Apprentice	`Fri May 28 1993 17:15`	0
2563.14	All the info you required	ROMA::TEP		`Tue Jun 01 1993 02:17`	23
	Hi Bob, The server attributes are as follows for VAX1, VAX6 and VAX8 in the cluster: Authorized Timeout: 600 Max Client Connects: 512 Distribution: OFF Max # of Drawers: 140 Drawer Cache: 50 Object Number: 73 Drawer Timeout: 600 Session Timeout: 1200 Drawer Collisions: 0 >Can you post the log file As per my previous note the logfile can be found on RIPPER::USER$TSC:[SETHI]OAFC$SERVER.DIGITAL_COPY;1, the protection is set to w:r. Regards, Sunil NB - It's nice working in Nashua taking calls from customers (:==:)!!! I am sure Andrew will know why ???!!!!
2563.15	ANOTHER OCCURRENCE OF MCC-E-FATAL_FW	AIMTEC::GRENIER_J		`Mon Jun 07 1993 14:50`	11
	One of our customers received the error about MCC-E-FATAL_FW "Fatal Framework Condition" in his log files when checking to see why the FCS did not start after running EW??? He shuts down ALL-IN-1 for EW and the last two weeks the FCS did not start on any of his three nodes. Prior to this it was working. That was the only "error type" message he could find in his log files. Hints??? Jean
2563.16	Hhhmmm, at our cluster, too	VNABRW::EHRLICH_K	Health & a long life with you	`Mon Jun 07 1993 14:58`	12
	Hi, last week I've detected the same at our CSC cluster here in Vienna. BUT the customer as mentioned in .0 has never ran into this behaviour. again. AND I'm really one day/every week at his site. But our cluster isn't very constant now. We have great problems with the heat in the room, so one or several nodes are 'leaving' the cluster every day. This could be the problem with the network! I don't have any ideas anymore here. Sorry. Regards Charly_from_CSC_Vienna
2563.17	Some more infos for y'all!	VNABRW::EHRLICH_K	Health & a long life with you	`Mon Jun 21 1993 11:36`	81
	Hi Kevin&Bob I've done some investigations for %MCC-E-FATAL_FW. Maybe this can help you a little bit more. sys$manager:oafc$server.log --------------------------- ... 18-MAY-1993 20:24:50.92 Server: VNACO1::"73=" Error: %DSL-W-SHUT, Network shutdown Message: Shutting Down server, network failure. 18-MAY-1993 20:24:52.43 Server: VNACO1::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvBufferProcess; receive alert to terminate thread 18-MAY-1993 20:24:52.95 Server: VNACO1::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvMemoryScavenge; receive alert to terminate thread 18-MAY-1993 20:24:54.80 Server: VNACO1::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: CsiCacheBlockAstService; Error from mcc_astevent_receive 18-MAY-1993 20:24:55.08 Server: VNACO1::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvTimeoutSysMan; receive alert to terminate thread 18-MAY-1993 20:24:55.17 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread 18-MAY-1993 20:24:55.26 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread 18-MAY-1993 20:24:55.36 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread 18-MAY-1993 20:24:55.53 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread ... 18-MAY-1993 20:30:13.71 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread 18-MAY-1993 20:30:13.84 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread 18-MAY-1993 20:30:14.01 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread 18-MAY-1993 20:30:14.15 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread 18-MAY-1993 20:30:14.33 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread 19-MAY-1993 08:37:01.41 Server: VNACO1::"73=" Message: Startup for File Cabinet Server V1.0-2 complete sys$manager:operator.log ------------------------ %%%%%%%%%%% OPCOM 18-MAY-1993 20:24:51.03 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:24:50.89) Message from user DFS$COM_ACP on VNACO1 %DFS-I-STOPPING, DFS/COM Stopping %%%%%%%%%%% OPCOM 18-MAY-1993 20:24:51.04 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:24:50.91) Message from user DFS$00010001 on VNACO1 %DFS-I-SRVEXIT, Server exiting (DFS$00010001_1) 18-MAY-1993 20:24:50.87 %%%%%%%%%%% OPCOM 18-MAY-1993 20:29:40.68 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:29:40.67) Message from user UCX$NFS on VNACO1 %UCX-I-NFS_DISERR, Disable writing into NFS errlog file %%%%%%%%%%% OPCOM 18-MAY-1993 20:29:52.70 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:29:52.69) Message from user INTERnet on VNACO1 INTERnet ACP Remote Terminal Services STOP - RLOGIN %%%%%%%%%%% OPCOM 18-MAY-1993 20:29:52.72 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:29:52.71) Message from user INTERnet on VNACO1 INTERnet ACP Remote Terminal Services STOP - TELNET %%%%%%%%%%% OPCOM 18-MAY-1993 20:29:56.65 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:29:56.64) Message from user INTERnet on VNACO1 INTERnet Shutdown %%%%%%%%%%% OPCOM 18-MAY-1993 20:30:33.81 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:30:33.79) $1$DUA0: (VNA03) has been removed from shadow set. %%%%%%%%%%% OPCOM 18-MAY-1993 20:30:33.83 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:30:33.80) $1$DUA2: (VNA03) has been removed from shadow set. %%%%%%%%%%% OPCOM 18-MAY-1993 20:30:51.55 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:30:51.54) Message from user AUDIT$SERVER on VNACO1 Security alarm (SECURITY) and security audit (SECURITY) on VNACO1, system id: 49312 Auditable event: Audit server shutting down Event time: 18-MAY-1993 20:30:51.52 PID: 2640027C %%%%%%%%%%% OPCOM 18-MAY-1993 20:30:51.74 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:30:51.71) Message from user on VNACO1 _VNACO1$VTA2:, VNACO1 shutdown was requested by the operator. %%%%%%%%%%% OPCOM 18-MAY-1993 20:30:52.03 %%%%%%%%%%% Operator _VNACO1$VTA2: has been disabled, username .... %%%%%%%%%%% OPCOM 18-MAY-1993 20:31:56.86 %%%%%%%%%%% 20:31:09.79 Node VNACO1 (csid 00010032) has been removed from the VAXcluster Hhmmm, guess what? Regards Charly
2563.18	Simple problem - must wait though	IOSG::CHINNICK	gone walkabout	`Fri Jul 09 1993 10:51`	16
	OK... I think that this is a really simple problem... When the network is shutting down or FCS is trying to shutdown all of its active threads are signalled to terminate. [By MCC returning the ALERT_TERMREQ status.] The SrvSysManTimeOut thread simply doesn't understand this status and tries to continue execution after it has been told to die. It gets further errors - notably the FATAL_FW and loops interminably. We are planning to fix this but it will be some small time before we have a new FCS to ship. More details when they come to hand. Paul.
2563.19	Oh! Looks pretty!	VNABRW::EHRLICH_K	Ronnie James DIO, vocals!	`Fri Jul 09 1993 14:07`	1

Conference iosg::all-in-1_v30

2563.0. "FCS and %MCC-E-FATAL_FW..." by VNABRW::EHRLICH_K (With the Power &amp; the Glory) Wed Apr 14 1993 09:45

2563.0. "FCS and %MCC-E-FATAL_FW..." by VNABRW::EHRLICH_K (With the Power & the Glory) Wed Apr 14 1993 09:45