[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:	OLD ALL-IN-1 (tm) Support Conference
Notice:	Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:	IOSG::PYE

Created:	Thu Jan 30 1992
Last Modified:	Tue Jan 23 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4343
Total number of notes:	18308

2353.0. "The file cabinet server disappears everyday" by BUSHIE::SETHI (Man from Downunder) Thu Mar 04 1993 04:26

    Hi All,
    
    A customer of mine has problems with the file cabinet server in that it
    disappears and the users get the error message:
     
    "%OAFC-I-SRVNOTAVAIL,file cabinet server is not available"
    
    Sure enough the process <node_name>$srv73 is no longer there.  I have
    exaimed the OAFC$SERVER.LOG and OAFC$SERVER_ERROR.LOG logs and there is
    nothing in there indicating what the problem is.  I asked the customer
    to turn on the server tracing options.  I examined the logfile by doing
    the following:
    
        $ format := $sys$system:oafc$print_trace_log.exe
        $ define sys$output trace.out
        $ format OA$DATA_SHARE:node$SERVER73_TRACE.DAT
        $ deass sys$output
    
    This contained the following:
    
    OAFC FUNCTION: OafcCloseCabinetW
    TRACE EVENT: Task Complete
    EVENT TIME:  4-MAR-1993 14:11:08.78
    FILE CABINET NAME: AUTC01.LINFORTH
    STATUS: 55803913
    STRING1 IS: GJL
    
    
    SESSION ID:  4457328
    TRACE EVENT: Disconnect Done
    EVENT TIME:  4-MAR-1993 14:11:08.89
    FILE CABINET NAME: AUTC01.LINFORTH
    
    As the last few lines these appear to be fine/normal.
    
    Now what else can I do to get extra information to help me track down
    this problem ?  This problem happens near enough everyday and I am
    running out of ideas so any help would be gratefully accepted.
    
    Thanks in advance,
    
    Sunil

T.R	Title	User	Personal Name	Date	Lines
2353.1	No answers, just suggestions.	IOSG::STANDAGE	Oink...Oink...Mooooooooooooooooooooooooooooooooo	`Thu Mar 04 1993 09:49`	33
	Sunil, What exactly are the last few messages in OAFC$SERVER.LOG ? These should indicate if the server process terminated via some 'normal' reason, or whether something a little more unusual is going on. For instance, when ALL-IN-1 is shut down (and hence the server), the following messages are written to the log file prior to the server process stopping : 3-MAR-1993 16:59:21.37 Server: TRON::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: CsiCacheBlockAstService; Error from mcc_astevent_receive 3-MAR-1993 16:59:26.13 Server: TRON::"73=" Error: %MCC-E-ALERT_TERMREQ, thread termination requested Message: SrvTimeoutSysMan; receive alert to terminate thread Are you running housekeeping procedures which shutdown ALL-IN-1, but the problem occurs because they are not being started up properly ? If there's no hints or clues in the log file, I think you need to find out when the server dies, and if there's any consistancy. Usually the log file will indicate if the server is unhappy. Kevin.
2353.2	multiple object 73's?	CHRLIE::HUSTON		`Thu Mar 04 1993 14:41`	23
	Sunil, as Kevin said, the server should not just "die", if it is being shut down nicely by someone, there will be several log messages in oafc$server.log about thread termination requested. If these are there someone is telling the FCS to shutdown. If there is nothing there, other than startup messages, then my guess is that someone is either doing a stop/id=FCS_PID or another possiblity, not sure how this would work, is if someone else is starting something up as DECnet object 73, either another server or some other application. Not sure what the effects of this would be, but having multiple applications up with the same obj number is bad. If you can get some sort of guess as to when the process goes away, it would help, turn tracing on just before that and see what happens. Sorry we can't give you more to go on. --Bob
2353.3	More info	BUSHIE::SETHI	Man from Downunder	`Fri Mar 05 1993 00:09`	36
	Hi Bob and Kevin, Having looked at the server log and your example there does seem to be a difference. The users were unable to access their shared drawers at 13:30 yesterday and here is part of the log: 3-MAR-1993 06:29:39.30 Server: AUTC01::"73=" Message: Startup for File Cabinet Server V1.0 complete 3-MAR-1993 22:57:24.14 Server: AUTC01::"73=" Error: %DSL-W-SHUT, Network shut down Message: Shutting Down server, network failure. 4-MAR-1993 10:04:04.47 Server: AUTC01::"73=" Message: Startup for File Cabinet Server V1.0 complete 4-MAR-1993 13:29:38.13 Server: AUTC01::"73=" Message: Startup for File Cabinet Server V1.0 complete The server was started at 4-MAR-1993 10:04:04.47 and in between it died and the customer restarted it at 4-MAR-1993 13:29:38.13. No error message are in the logfile to point to the reason for the failure. Please note the customer reboots his system every night at 11:00 pm. I have asked the customer to enable accounting to enable me to get extra information. I have copied the logfile to RIPPER::Q30178.LOG_2 it may have something in there that I just did not pick up. Hopefully either the server trace will pickup something or the account. Finally the customer has assured me that they do not have other applications running on the system therefore object 73 is not being used for anything else. Thanks for you advise will keep you posted, Sunil
2353.4	I'll look atthe log	IOSG::STANDAGE	Oink...Oink...Mooooooooooooooooooooooooooooooooo	`Fri Mar 05 1993 09:22`	25
	Sunil, As you said, the system is rebooted at 11pm, so that explains the "Error: %DSL-W-SHUT,Network shut down" message. As the system is about to go away the server shuts itself down. So it appears that the problem occurs between the last two startup messages. As nothing else has been logged the server certainly did not die from natural causes, at least it doesn't seem that way. Even if someone is doing something to seriously upset the server, some form of message would appear in the log. When I get time I'll take a look at the log you have provided. The next step is to probably see if the server seems to go away around the same time each day. At the moment, the only way I can see this happening is if someone did a STOP PROC/ID of the process. Kevin.
2353.5	STOP/ID writes messages to the log file	SCOTTC::MARSHALL	Spitfire Drivers Do It Topless	`Fri Mar 05 1993 10:15`	7
	Re: STOP/ID When I do that, several "thread termination" messages get written to the log file. So it doesn't look like anyone's doing that (unless they also lock the log file first to stop the server writing to it! :-) Scott
2353.6	You won't always get "thread termination"	IOSG::STANDAGE	Oink...Oink...Mooooooooooooooooooooooooooooooooo	`Fri Mar 05 1993 11:24`	11
	Scott, This isn't always the case, it very much depends of what is happening on the system at the time. I just did this on a test machine (server state "HIB") - and no thread termination messages were produced. Kevin.
2353.7	run the server in the foreground	CHRLIE::HUSTON		`Fri Mar 05 1993 15:37`	48
	I can think of 2 ways to have the server go away with no message: 1) stop/id -- I have never seen it log a message, the process is stopped immediately so it won't have enough time to write a message. This is usually how we stop servers during our testing. 2) The server itself access violated. The server runs as two layers, the bottom layer does about 98% of the work and any access violation at this level will be written to the log file via a condition handler. The upper level does all the dasl and DECnet interaction, it has no condition handler and runs at AST level. THese routines are called by DASL in response to certain DASL events such as receiving a DASL message. Unfortunately, since the server runs as a detached process if this layer access violates the process will silently go away. A problem at this layer could be either the server, or DASL. Do you know what version of DASL they are using? The FCS ships with V2.0, I know that there is a V2.2, we have not tested against it, and theoretically it should work due to backwards compatibility, but who knows, maybe there is a problem What can you do next? Start the server in the foreground, not through ALL-IN-1. Do the following: $ A1FCS :== $sys$system:oafc$server.exe $ A1FCS your_configuration_file.dat to get you config file name, go to the MS menu and do a R on the server, it will show you the config file. Note that when you start the server up like this, the server is running in the context of the process you do the command from. Your best choice for this is to log into the OAFC$SERVER account (made during installation), you may have to mess around in the UAF record to allow logins since the account is installed as DISUSER'd. If this is not do-able, the next best choice is the ALLIN1 account or SYSTEM, either should have suitable privs and quotas to run the server. When you do this, if the server access violates at the top level, you will see the access violation on the screen, please save it and either send it to me or post it here. Thanks --Bob
2353.10	Changed some sysuaf parameters and monitoring	BUSHIE::SETHI	Man from Downunder	`Fri Mar 12 1993 05:45`	40
	Hi All, The customer had the problem reoccur yet again and we had accounting enabled but the customer forgot to turn on tracing (makes me feel grumpy 8*{). The accounting file did not have a record for the process nor did the OAFC$SERVER.LOG file, I also did an analyze/error/include=bugcheck and found nothing. I than audited the OAFC$SERVER account and the SYSTEM account and found the following: mod OAFC$SERVER/BIOlm=50/DIOlm=50/astlm=100/TQElm=50/enqlm=300, I other words :-) these quotas were 5 times below what I changed them to. The system account did not have the OA$MANAGER identifier, I don't know if it required it but I granted it as per my system. I asked the customer to reboot the system and he did so during the lunch hour. So far he has not reported any problems and it seems that this is the first time after a reboot he has not had any minor or major problems. I will monitor the system and report back any findings. One thing though why has the accounting file not got an entry for the process starting and stopping ? Accounting was enabled before ALL-IN-1 was started. One last question Bob ;-), What is DASL ? How do I find out what version the customer has installed ? >$ A1FCS :== $sys$system:oafc$server.exe >$ A1FCS your_configuration_file.dat I did all of this no stack dumps etc. Regards, Sunil
2353.11	DASL = DECNet i/f; Care with Trace file size...	CHRLIE::HUSTON		`Fri Mar 12 1993 13:37`	28
	DASL is Distributed Service Application Layer. It is a protocol that lays on top of DECnet, the FCS uses it for all its DECnet work. Removes us from needing to make DECnet calls. DASL is not shipped as a product if a shipping product needs it (like the FCS) then it is up to that product to supply DASL. We include V2.0 in the kits so they have at least V2.0. Ok, if this never stack dumped, did it simply go away? You said the server went away again, was there no message at this terminal? Running the server is this manner simply runs the server is the foreground process rather than as a detached process. If you run the server in this way and it access violates outside the scope of the condition handler, then you would see the access violation. If the process simply died, not sure how, then what you would probably see is the startup message, then a '$' saying you were done and back at DCL. Before you do this, please go into ALL-IN-1 and stop the server that ALL-IN-1 starts, else all kinds of fun things happen. Also, if you cannot narrow down a time or circumstance that the server goes away on, I do not recommend turning tracing on. Each trace record is 1024 bytes and each request to the server takes an ABSOLUTE MINIMUM of 2 trace records or 2048 bytes. Most events take more than 2 trace records. So running the FCS with tracing on all the time is rather disk intensive. --Bob
2353.12		BUSHIE::SETHI	Man from Downunder	`Tue Mar 16 1993 04:32`	19
	G'day All, The problem has been solved. Basically it was a bit of this and a bit of that :-). The problem was caused by a in-house process killing job running on a batch queue. Aaaaahhhhh !!!! I had asked the customer many a time if a stop/id= was being done on the process and he said "No". The lesson of this hair pulling story is: 1. Never trust a customer when he say's no to the obvious question 2. Show system does not always show process killers, especially when there process names have not been set. 3. Process killers can run on batch queues Thanks to all of you for your help, Sunil