T.R | Title | User | Personal Name | Date | Lines |
---|
2353.1 | No answers, just suggestions. | IOSG::STANDAGE | Oink...Oink...Mooooooooooooooooooooooooooooooooo | Thu Mar 04 1993 09:49 | 33 |
|
Sunil,
What exactly are the last few messages in OAFC$SERVER.LOG ? These
should indicate if the server process terminated via some 'normal'
reason, or whether something a little more unusual is going on.
For instance, when ALL-IN-1 is shut down (and hence the server), the
following messages are written to the log file prior to the server
process stopping :
3-MAR-1993 16:59:21.37 Server: TRON::"73="
Error: %MCC-E-ALERT_TERMREQ, thread termination requested
Message: CsiCacheBlockAstService; Error from mcc_astevent_receive
3-MAR-1993 16:59:26.13 Server: TRON::"73="
Error: %MCC-E-ALERT_TERMREQ, thread termination requested
Message: SrvTimeoutSysMan; receive alert to terminate thread
Are you running housekeeping procedures which shutdown ALL-IN-1, but
the problem occurs because they are not being started up properly ?
If there's no hints or clues in the log file, I think you need to find
out when the server dies, and if there's any consistancy. Usually the
log file will indicate if the server is unhappy.
Kevin.
|
2353.2 | multiple object 73's? | CHRLIE::HUSTON | | Thu Mar 04 1993 14:41 | 23 |
|
Sunil,
as Kevin said, the server should not just "die", if it is being
shut down nicely by someone, there will be several log messages
in oafc$server.log about thread termination requested. If these are
there someone is telling the FCS to shutdown.
If there is nothing there, other than startup messages, then my
guess is that someone is either doing a stop/id=FCS_PID or
another possiblity, not sure how this would work, is if someone else
is starting something up as DECnet object 73, either another server
or some other application. Not sure what the effects of this would
be, but having multiple applications up with the same obj number is
bad.
If you can get some sort of guess as to when the process goes away,
it would help, turn tracing on just before that and see what happens.
Sorry we can't give you more to go on.
--Bob
|
2353.3 | More info | BUSHIE::SETHI | Man from Downunder | Fri Mar 05 1993 00:09 | 36 |
| Hi Bob and Kevin,
Having looked at the server log and your example there does seem to be
a difference. The users were unable to access their shared drawers at
13:30 yesterday and here is part of the log:
3-MAR-1993 06:29:39.30 Server: AUTC01::"73=" Message: Startup for
File Cabinet Server V1.0 complete
3-MAR-1993 22:57:24.14 Server: AUTC01::"73=" Error: %DSL-W-SHUT,
Network shut down Message: Shutting Down server, network failure.
4-MAR-1993 10:04:04.47 Server: AUTC01::"73=" Message: Startup for
File Cabinet Server V1.0 complete
4-MAR-1993 13:29:38.13 Server: AUTC01::"73=" Message: Startup for
File Cabinet Server V1.0 complete
The server was started at 4-MAR-1993 10:04:04.47 and in between it died
and the customer restarted it at 4-MAR-1993 13:29:38.13. No error
message are in the logfile to point to the reason for the failure. Please
note the customer reboots his system every night at 11:00 pm.
I have asked the customer to enable accounting to enable me to get
extra information. I have copied the logfile to RIPPER::Q30178.LOG_2
it may have something in there that I just did not pick up. Hopefully
either the server trace will pickup something or the account.
Finally the customer has assured me that they do not have other
applications running on the system therefore object 73 is not being
used for anything else.
Thanks for you advise will keep you posted,
Sunil
|
2353.4 | I'll look atthe log | IOSG::STANDAGE | Oink...Oink...Mooooooooooooooooooooooooooooooooo | Fri Mar 05 1993 09:22 | 25 |
|
Sunil,
As you said, the system is rebooted at 11pm, so that explains the
"Error: %DSL-W-SHUT,Network shut down" message. As the system is about
to go away the server shuts itself down.
So it appears that the problem occurs between the last two startup
messages. As nothing else has been logged the server certainly did not
die from natural causes, at least it doesn't seem that way. Even if
someone is doing something to seriously upset the server, some form of
message would appear in the log.
When I get time I'll take a look at the log you have provided. The next
step is to probably see if the server seems to go away around the same
time each day.
At the moment, the only way I can see this happening is if someone did
a STOP PROC/ID of the process.
Kevin.
|
2353.5 | STOP/ID writes messages to the log file | SCOTTC::MARSHALL | Spitfire Drivers Do It Topless | Fri Mar 05 1993 10:15 | 7 |
| Re: STOP/ID
When I do that, several "thread termination" messages get written to the log
file. So it doesn't look like anyone's doing that (unless they also lock the
log file first to stop the server writing to it! :-)
Scott
|
2353.6 | You won't always get "thread termination" | IOSG::STANDAGE | Oink...Oink...Mooooooooooooooooooooooooooooooooo | Fri Mar 05 1993 11:24 | 11 |
|
Scott,
This isn't always the case, it very much depends of what is happening
on the system at the time. I just did this on a test machine (server
state "HIB") - and no thread termination messages were produced.
Kevin.
|
2353.7 | run the server in the foreground | CHRLIE::HUSTON | | Fri Mar 05 1993 15:37 | 48 |
|
I can think of 2 ways to have the server go away with no message:
1) stop/id -- I have never seen it log a message, the process is
stopped immediately so it won't have enough time to write a message.
This is usually how we stop servers during our testing.
2) The server itself access violated. The server runs as two layers,
the bottom layer does about 98% of the work and any access violation
at this level will be written to the log file via a condition handler.
The upper level does all the dasl and DECnet interaction, it has no
condition handler and runs at AST level. THese routines are called
by DASL in response to certain DASL events such as receiving a
DASL message. Unfortunately, since the server runs as a detached
process if this layer access violates the process will silently go
away. A problem at this layer could be either the server, or
DASL. Do you know what version of DASL they are using? The FCS
ships with V2.0, I know that there is a V2.2, we have not tested
against it, and theoretically it should work due to backwards
compatibility, but who knows, maybe there is a problem
What can you do next?
Start the server in the foreground, not through ALL-IN-1. Do the
following:
$ A1FCS :== $sys$system:oafc$server.exe
$ A1FCS your_configuration_file.dat
to get you config file name, go to the MS menu and do a R on the
server, it will show you the config file.
Note that when you start the server up like this, the server is running
in the context of the process you do the command from. Your best choice
for this is to log into the OAFC$SERVER account (made during
installation), you may have to mess around in the UAF record to
allow logins since the account is installed as DISUSER'd. If this is
not do-able, the next best choice is the ALLIN1 account or SYSTEM,
either should have suitable privs and quotas to run the server.
When you do this, if the server access violates at the top level, you
will see the access violation on the screen, please save it and either
send it to me or post it here.
Thanks
--Bob
|
2353.10 | Changed some sysuaf parameters and monitoring | BUSHIE::SETHI | Man from Downunder | Fri Mar 12 1993 05:45 | 40 |
| Hi All,
The customer had the problem reoccur yet again and we had accounting
enabled but the customer forgot to turn on tracing (makes me feel
grumpy 8*{).
The accounting file did not have a record for the process nor did the
OAFC$SERVER.LOG file, I also did an analyze/error/include=bugcheck and
found nothing.
I than audited the OAFC$SERVER account and the SYSTEM account and found
the following:
mod OAFC$SERVER/BIOlm=50/DIOlm=50/astlm=100/TQElm=50/enqlm=300, I other
words :-) these quotas were 5 times below what I changed them to. The
system account did not have the OA$MANAGER identifier, I don't know if
it required it but I granted it as per my system.
I asked the customer to reboot the system and he did so during the
lunch hour. So far he has not reported any problems and it seems that
this is the first time after a reboot he has not had any minor or major
problems. I will monitor the system and report back any findings.
One thing though why has the accounting file not got an entry for the
process starting and stopping ? Accounting was enabled before ALL-IN-1
was started.
One last question Bob ;-),
What is DASL ? How do I find out what version the customer has
installed ?
>$ A1FCS :== $sys$system:oafc$server.exe
>$ A1FCS your_configuration_file.dat
I did all of this no stack dumps etc.
Regards,
Sunil
|
2353.11 | DASL = DECNet i/f; Care with Trace file size... | CHRLIE::HUSTON | | Fri Mar 12 1993 13:37 | 28 |
| DASL is Distributed Service Application Layer. It is a protocol that
lays on top of DECnet, the FCS uses it for all its DECnet work. Removes
us from needing to make DECnet calls. DASL is not shipped as a product
if a shipping product needs it (like the FCS) then it is up to that
product to supply DASL. We include V2.0 in the kits so they have at
least V2.0.
Ok, if this never stack dumped, did it simply go away? You said the
server went away again, was there no message at this terminal?
Running the server is this manner simply runs the server is the
foreground process rather than as a detached process. If you run
the server in this way and it access violates outside the scope of
the condition handler, then you would see the access violation. If
the process simply died, not sure how, then what you would probably
see is the startup message, then a '$' saying you were done and back
at DCL.
Before you do this, please go into ALL-IN-1 and stop the server that
ALL-IN-1 starts, else all kinds of fun things happen.
Also, if you cannot narrow down a time or circumstance that the server
goes away on, I do not recommend turning tracing on. Each trace record
is 1024 bytes and each request to the server takes an ABSOLUTE MINIMUM
of 2 trace records or 2048 bytes. Most events take more than 2 trace
records. So running the FCS with tracing on all the time is rather
disk intensive.
--Bob
|
2353.12 | | BUSHIE::SETHI | Man from Downunder | Tue Mar 16 1993 04:32 | 19 |
| G'day All,
The problem has been solved. Basically it was a bit of this and a bit
of that :-).
The problem was caused by a in-house process killing job running on a
batch queue. Aaaaahhhhh !!!! I had asked the customer many a time if a
stop/id= was being done on the process and he said "No".
The lesson of this hair pulling story is:
1. Never trust a customer when he say's no to the obvious question
2. Show system does not always show process killers, especially when there
process names have not been set.
3. Process killers can run on batch queues
Thanks to all of you for your help,
Sunil
|