T.R | Title | User | Personal Name | Date | Lines |
---|
2563.1 | Half an answer | IOSG::STANDAGE | It's a Burgh kind of thing | Wed Apr 14 1993 12:18 | 36 |
|
Charly,
It looks as though there was a problem deleting a system management
thread which had exceeded it's timeout period.
When a normal user performs an operation which invokes the server,
the task is undertaken by a server 'thread'. These threads exist
at server startup time and are 'woken up' when required for use, or
put to 'sleep' when they are not needed.
System Management threads are slightly different as for performance
reasons they remain 'awake' for a defined period of time after
becoming idle. This length of time is determined by the Session Timeout
vakue in the server configuration file (SM MFC MS R to look at this
value).
It seems that a request was made to delete the thread, but it was not
successful and so you got an endless loop of repeated attempts which
where being logged.
I haven't seen this happen before, but as it occurred shortly after
upgrading to V3.0-1 it may well happen again soon. Alternatively, it
may never happen again - so please keep an eye on the size of the
server log files and let me know if you see this happening again - it's
one of those situations which is difficult to reproduce.
I'm sure if I've got any of this wrong I'll be corrected shortly ! :-)
Cheers,
Kevin.
|
2563.2 | Do you mean by --Bob? | VNABRW::EHRLICH_K | With the Power & the Glory | Wed Apr 14 1993 12:31 | 11 |
| Kevin,
thanx a lot for explanation. You're right, I'll keep an eye on
the logfile on this node. It's my customer, therefore I've no problem
to login whenever I (or you) need something 'special'.
Let's have a cup of tea and wait ....
Best regards & have a nice day
Charly
|
2563.3 | Sounds like a nasty problem | CHRLIE::HUSTON | | Wed Apr 14 1993 20:08 | 52 |
|
re .1
>I'm sure if I've got any of this wrong I'll be corrected shortly ! :-)
Is this soon enough? Actually it is mostly a techinicality, but I feel
bad since I tried to explain this, obviously one of us missed
something :-)
>When a normal user performs an operation which invokes the server,
>the task is undertaken by a server 'thread'. These threads exist
>at server startup time and are 'woken up' when required for use, or
>put to 'sleep' when they are not needed.
>
>System Management threads are slightly different as for performance
>reasons they remain 'awake' for a defined period of time after
>becoming idle. This length of time is determined by the Session Timeout
>vakue in the server configuration file (SM MFC MS R to look at this
>value).
Close, very close, but not all correct. "Normal users" by which I mean
people coming in from IOS doing an OafcOpenCabinet (ie non-system
managers) will not use one of the background threads. Each request they
make has a thread created specifically for it, when the task is done,
the thread is killed. Background threads are not related to users at
all.
System management requests are different, but only slightly. Since
there is no explicit session from the users point of view, there is
nothing for the user to terminate. What happens is behind the scenese
the FCS does a open cabinet on behalf of the system manager and
assigns it a session. Then when they make a call to the FCS a thread is
created and killed just as before. What is timed out is the system
management session.
>It seems that a request was made to delete the thread, but it was not
>successful and so you got an endless loop of repeated attempts which
>where being logged.
This part is right, what I THINK it is trying to do is to kill
the background thread that periodically wakes up and timesout
system management sessions. This would also explain why the shutdown
never finished. The thread is trying to kill itself but can't so
it is looping. The server shutdown will not finish until all
background threads have shutdown. Cant' think of why this wouldn't
work though. One of those things that someone needs to find a way to
reproduce and then go in and poke around and see.
--Bob
|
2563.4 | New debugging aid, FYI | IOSG::TALLETT | Gimmee an Alpha colour notebook... | Fri Apr 16 1993 09:49 | 13 |
|
Just FYI, there's a great looking hack in the HACKERS conference
that allows you to force any process in the system to enter the
debugger. You run this priveleged program, point it at the process
and you end up with a debug session on your terminal for the
process in question.
Might be useful for this type of problem as you could jump in
when the problem occurs. Might be useful to build the server
debug so you have a symbol table...
Regards,
Paul
|
2563.5 | and hand it over to you (:==:)! | VNABRW::EHRLICH_K | With the Power & the Glory | Tue Apr 20 1993 09:28 | 14 |
| Good morning,
sorry for the delay, but I've caught a terrible cold and my voice
is lost even today.
So, I'm going to check in HACKERS for this tool. And when the problem
occours anymore I'll do it. Then I'm going to hand over the results to
you, won't I?
FYI: The loop never happened til' today.
Best Regards and thank y'all
Charly_from_CSC_Vienna
|
2563.6 | Observed details on %MCC-E-FATAL_FW, fatal framework | GIDDAY::LEH | | Wed May 26 1993 09:10 | 51 |
| "Fatal framework..." errors have been occurring in a number of sites running
3.0-1 although the effects were either unknown or not recorded. One of them
was shared drawers losing their share status and the sharers unable to access
these drawers.
Attempt to restore REGULAR share status via MFC MD MDT was not successful
despite of the visual impression given when working in the involved form
FC$MDT. ACL setups didn't seem to be affected by these changes
The log file OAFC$SERVER.LOG expanded very quickly with %MCC-E-FATAL_FW
events, and the sequence was very much like:
11-MAR-1993 23:02:47.85 Server: ADL01V::"73="
Message: Startup for File Cabinet Server V1.0-2 complete
followed by , but not always in same day
11-MAR-1993 23:07:41.62 Server: ADL01V::"73="
Error: %DSL-W-SHUT, Network shutdown
Message: Shutting Down server, network failure.
followed by 3 or 4 %MCC-E-ALERT_TERMREQ events
11-MAR-1993 23:07:42.28 Server: ADL01V::"73="
Error: %MCC-E-ALERT_TERMREQ, thread termination requested
Message: SrvBufferProcess; receive alert to terminate thread
then followed by some 4,000+ MCC-E-FATAL_FW events happening in
around 10 minutes
11-MAR-1993 23:07:44.11 Server: ADL01V::"73="
Error: %MCC-E-FATAL_FW, fatal framework condition: !AS
Message: SrvTimeoutSysMan; receive alert to terminate thread
Sometimes the FC server startup encountered
15-MAR-1993 23:05:30.12 Server: ADL01V::"73=" Error: %OAFC-W-NETLOST,
The network connection to the File Cabinet Server was lost
Message: Network lost, server config record not updated.
but seemed to recover with %DSL-W-SHUT followed by another FC server startup,
which ocurred immediately after.
On the system where the above details were collected, %MCC-E-FATAL_FW events
happened on the same day 3.0-1 was put on; and for 4 previous months running
3.0, there was no single event of same type
Thanks for any comments
Hong
CSC Sydney
|
2563.7 | Seen it, but not easy to reproduce... | IOSG::STANDAGE | | Wed May 26 1993 11:52 | 21 |
|
Hong,
Yes, I've seen this too, but only in the following instance.
One of our machines was being shutdown, and for some reason the
%MCC-E-FATAL_FW error was logged multiple times by the server just
before everything went dead. Usually you do just get the DSL shutdown
message and the FCS dies quietly. I've never seen this message produced
during the normal day-to-day life of the server, so the impact of this
I cannot determine for sure. Reproducing such things is a very
difficult task - but we have bugged the problem and we're monitoring
the frequency at which it happens...
I'll append your comments to the bug report.
Thanks,
Kevin.
|
2563.8 | Not sure if this is it or not | CHRLIE::HUSTON | | Wed May 26 1993 14:10 | 25 |
|
In a normal, by the book shutdown you won't get the severe mcc errors
that you are seeing. Just the ones about requests to terminate
threads (which aren't really errors, that is the FCS killing all
the background threads, they are logged as errors since the internal
mcc workings notify a thread to commit suicide by waking it up with
an error code.)
I THINK the reason that you are getting the severe errors is this:
The network is shutting down (for whatever reason-- DASL problem,
system shutdown, DECnet shutdown, whatever), when DASL is told
to shut down, it notifies all it's current applications, FCS being one
of them, to immediately stop. IF you have a task currently executing,
the task will be very un-nicely shot, potentially in mid-task, this
may be what is causing the MCC errors you are seeing. You can
potentially have several threads per task, and each could have
multiple mutexes locked. When the FCS is told to stop, by DASL, the
FCS does not have the chance to nicely run down tasks, it can't, the
network is about to go down, and there is no mechanism for the FCS
to ask DASL to wait a few minutes before it dies.
Not sure if that is the cause, but it would explain it.
--Bob
|
2563.9 | what about loss of regular sharing mode ? | GIDDAY::LEH | | Wed May 26 1993 14:38 | 18 |
| Kevin and Bob in .7 and .8
Thanks for shedding lights on the possible cause. Will keep an eye on
this error.
Today saw similar incidents at another site just going to 3.0-1 for
about 2 weeks, where regular sha drawerswent back to non-share mode and
and the only known fix was to restart FCS.
We're trying to collect from the customers any unusual tasks they'd
done or any observations they'd have had, but if this keeps happening,
extreme hardships would be felt and pressures mounted.
Apart from tracing, any other advices ?
Thanks
Hong
|
2563.10 | How's the network? Is it stable? | CHRLIE::HUSTON | | Wed May 26 1993 18:49 | 22 |
|
re .9
>Today saw similar incidents at another site just going to 3.0-1 for
>about 2 weeks, where regular sha drawerswent back to non-share mode and
>and the only known fix was to restart FCS.
It sounds like the internal drawer cache is getting corrupt, especially
if you fix it by restarting the FCS. The FCS maintains a cache of
security information for open drawers. How long did you wait before
re-starting the FCS once you noticed the drawer sharing type was
changed? The reason I ask is that this security informaion will time
out fairly quickly (default is 10 minutes I think). If the problem is
the cache, then the time out will fix it.
Other advice? Nope, I also am not sure if tracing will help, all it
will tell you is what actions the FCS was told to do, it won't
tell you who mangled internal structures.
--Bob
|
2563.11 | Cache is not being flushed after 10 mins | BUSHIE::SETHI | Ahhhh (-: an upside down smile from OZ | Fri May 28 1993 02:51 | 27 |
| Hi Bob,
>It sounds like the internal drawer cache is getting corrupt, especially
>if you fix it by restarting the FCS. The FCS maintains a cache of
>security information for open drawers. How long did you wait before
>re-starting the FCS once you noticed the drawer sharing type was
>changed? The reason I ask is that this security informaion will time
>out fairly quickly (default is 10 minutes I think). If the problem is
>the cache, then the time out will fix it.
I am working on the same problem at another site and I can confirm that
the customer waited a few hours before stopping the FCS. So it looks
like the cache is not getting flushed.
I have a copy of the logfile on
RIPPER::USER$TSC:[SETHI]OAFC$SERVER.DIGITAL_COPY and the file
protection is set to w:r. This logfile has a number of problems
reported and I will reply to notes that are already discussing the
problem.
The customer has tuned his FCS as per the manual and still has the
problem. I would be grateful for an feedback you may have. Is there
anything you want me to do to get you extra information ?
Regards,
Sunil
|
2563.12 | That's really strange | CHRLIE::HUSTON | | Fri May 28 1993 15:53 | 9 |
|
Can you post the log file, and also do a SM MFC MS R and see
what the authorization timeout is.
It is possible that the fatal thread error is the authorization
timeout thread.
--Bob
|
2563.13 | Preferably a *pointer* to the log file :-) | IOSG::PYE | Graham - ALL-IN-1 Sorcerer's Apprentice | Fri May 28 1993 18:15 | 0 |
2563.14 | All the info you required | ROMA::TEP | | Tue Jun 01 1993 03:17 | 23 |
| Hi Bob,
The server attributes are as follows for VAX1, VAX6 and VAX8 in the
cluster:
Authorized Timeout: 600 Max Client Connects: 512
Distribution: OFF Max # of Drawers: 140
Drawer Cache: 50 Object Number: 73
Drawer Timeout: 600 Session Timeout: 1200
Drawer Collisions: 0
>Can you post the log file
As per my previous note the logfile can be found on
RIPPER::USER$TSC:[SETHI]OAFC$SERVER.DIGITAL_COPY;1, the protection is
set to w:r.
Regards,
Sunil
NB - It's nice working in Nashua taking calls from customers (:==:)!!!
I am sure Andrew will know why ???!!!!
|
2563.15 | ANOTHER OCCURRENCE OF MCC-E-FATAL_FW | AIMTEC::GRENIER_J | | Mon Jun 07 1993 15:50 | 11 |
| One of our customers received the error about MCC-E-FATAL_FW
"Fatal Framework Condition" in his log files when checking to
see why the FCS did not start after running EW??? He shuts
down ALL-IN-1 for EW and the last two weeks the FCS did not
start on any of his three nodes. Prior to this it was working.
That was the only "error type" message he could find in his
log files.
Hints???
Jean
|
2563.16 | Hhhmmm, at our cluster, too | VNABRW::EHRLICH_K | Health & a long life with you | Mon Jun 07 1993 15:58 | 12 |
| Hi,
last week I've detected the same at our CSC cluster here in Vienna.
BUT the customer as mentioned in .0 has never ran into this behaviour.
again. AND I'm really one day/every week at his site.
But our cluster isn't very constant now. We have great problems with
the heat in the room, so one or several nodes are 'leaving' the cluster
every day. This could be the problem with the network!
I don't have any ideas anymore here. Sorry.
Regards
Charly_from_CSC_Vienna
|
2563.17 | Some more infos for y'all! | VNABRW::EHRLICH_K | Health & a long life with you | Mon Jun 21 1993 12:36 | 81 |
| Hi Kevin&Bob
I've done some investigations for %MCC-E-FATAL_FW.
Maybe this can help you a little bit more.
sys$manager:oafc$server.log
---------------------------
...
18-MAY-1993 20:24:50.92 Server: VNACO1::"73=" Error: %DSL-W-SHUT, Network shutdown Message: Shutting Down server, network failure.
18-MAY-1993 20:24:52.43 Server: VNACO1::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvBufferProcess; receive alert to terminate thread
18-MAY-1993 20:24:52.95 Server: VNACO1::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvMemoryScavenge; receive alert to terminate thread
18-MAY-1993 20:24:54.80 Server: VNACO1::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: CsiCacheBlockAstService; Error from mcc_astevent_receive
18-MAY-1993 20:24:55.08 Server: VNACO1::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:24:55.17 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:24:55.26 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:24:55.36 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:24:55.53 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
...
18-MAY-1993 20:30:13.71 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:30:13.84 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:30:14.01 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:30:14.15 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
18-MAY-1993 20:30:14.33 Server: VNACO1::"73=" Error: %MCC-E-FATAL_FW, fatal framework condition: !AS Message: SrvTimeoutSysMan; receive alert to terminate thread
19-MAY-1993 08:37:01.41 Server: VNACO1::"73=" Message: Startup for File Cabinet Server V1.0-2 complete
sys$manager:operator.log
------------------------
%%%%%%%%%%% OPCOM 18-MAY-1993 20:24:51.03 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:24:50.89)
Message from user DFS$COM_ACP on VNACO1
%DFS-I-STOPPING, DFS/COM Stopping
%%%%%%%%%%% OPCOM 18-MAY-1993 20:24:51.04 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:24:50.91)
Message from user DFS$00010001 on VNACO1
%DFS-I-SRVEXIT, Server exiting (DFS$00010001_1) 18-MAY-1993 20:24:50.87
%%%%%%%%%%% OPCOM 18-MAY-1993 20:29:40.68 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:29:40.67)
Message from user UCX$NFS on VNACO1
%UCX-I-NFS_DISERR, Disable writing into NFS errlog file
%%%%%%%%%%% OPCOM 18-MAY-1993 20:29:52.70 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:29:52.69)
Message from user INTERnet on VNACO1
INTERnet ACP Remote Terminal Services STOP - RLOGIN
%%%%%%%%%%% OPCOM 18-MAY-1993 20:29:52.72 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:29:52.71)
Message from user INTERnet on VNACO1
INTERnet ACP Remote Terminal Services STOP - TELNET
%%%%%%%%%%% OPCOM 18-MAY-1993 20:29:56.65 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:29:56.64)
Message from user INTERnet on VNACO1
INTERnet Shutdown
%%%%%%%%%%% OPCOM 18-MAY-1993 20:30:33.81 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:30:33.79)
$1$DUA0: (VNA03) has been removed from shadow set.
%%%%%%%%%%% OPCOM 18-MAY-1993 20:30:33.83 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:30:33.80)
$1$DUA2: (VNA03) has been removed from shadow set.
%%%%%%%%%%% OPCOM 18-MAY-1993 20:30:51.55 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:30:51.54)
Message from user AUDIT$SERVER on VNACO1
Security alarm (SECURITY) and security audit (SECURITY) on VNACO1, system id: 49312
Auditable event: Audit server shutting down
Event time: 18-MAY-1993 20:30:51.52
PID: 2640027C
%%%%%%%%%%% OPCOM 18-MAY-1993 20:30:51.74 %%%%%%%%%%% (from node VNACO1 at 18-MAY-1993 20:30:51.71)
Message from user on VNACO1
_VNACO1$VTA2:, VNACO1 shutdown was requested by the operator.
%%%%%%%%%%% OPCOM 18-MAY-1993 20:30:52.03 %%%%%%%%%%%
Operator _VNACO1$VTA2: has been disabled, username
....
%%%%%%%%%%% OPCOM 18-MAY-1993 20:31:56.86 %%%%%%%%%%%
20:31:09.79 Node VNACO1 (csid 00010032) has been removed from the VAXcluster
Hhmmm, guess what?
Regards
Charly
|
2563.18 | Simple problem - must wait though | IOSG::CHINNICK | gone walkabout | Fri Jul 09 1993 11:51 | 16 |
|
OK... I think that this is a really simple problem...
When the network is shutting down or FCS is trying to shutdown
all of its active threads are signalled to terminate. [By MCC returning
the ALERT_TERMREQ status.]
The SrvSysManTimeOut thread simply doesn't understand this status and
tries to continue execution after it has been told to die. It gets
further errors - notably the FATAL_FW and loops interminably.
We are planning to fix this but it will be some small time before we
have a new FCS to ship. More details when they come to hand.
Paul.
|
2563.19 | Oh! Looks pretty! | VNABRW::EHRLICH_K | Ronnie James DIO, vocals! | Fri Jul 09 1993 15:07 | 1 |
|
|