[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:*OLD* ALL-IN-1 (tm) Support Conference
Notice:Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:IOSG::PYE
Created:Thu Jan 30 1992
Last Modified:Tue Jan 23 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:4343
Total number of notes:18308

2353.0. "The file cabinet server disappears everyday" by BUSHIE::SETHI (Man from Downunder) Thu Mar 04 1993 04:26

    Hi All,
    
    A customer of mine has problems with the file cabinet server in that it
    disappears and the users get the error message:
     
    "%OAFC-I-SRVNOTAVAIL,file cabinet server is not available"
    
    Sure enough the process <node_name>$srv73 is no longer there.  I have
    exaimed the OAFC$SERVER.LOG and OAFC$SERVER_ERROR.LOG logs and there is
    nothing in there indicating what the problem is.  I asked the customer
    to turn on the server tracing options.  I examined the logfile by doing
    the following:
    
        $ format := $sys$system:oafc$print_trace_log.exe
        $ define sys$output trace.out
        $ format OA$DATA_SHARE:node$SERVER73_TRACE.DAT
        $ deass sys$output
    
    This contained the following:
    
    OAFC FUNCTION: OafcCloseCabinetW
    TRACE EVENT: Task Complete
    EVENT TIME:  4-MAR-1993 14:11:08.78
    FILE CABINET NAME: AUTC01.LINFORTH
    STATUS: 55803913
    STRING1 IS: GJL
    
    
    SESSION ID:  4457328
    TRACE EVENT: Disconnect Done
    EVENT TIME:  4-MAR-1993 14:11:08.89
    FILE CABINET NAME: AUTC01.LINFORTH
    
    As the last few lines these appear to be fine/normal.
    
    Now what else can I do to get extra information to help me track down
    this problem ?  This problem happens near enough everyday and I am
    running out of ideas so any help would be gratefully accepted.
    
    Thanks in advance,
    
    Sunil
T.RTitleUserPersonal
Name
DateLines
2353.1No answers, just suggestions.IOSG::STANDAGEOink...Oink...MoooooooooooooooooooooooooooooooooThu Mar 04 1993 09:4933
    
    
    Sunil,
    
    What exactly are the last few messages in OAFC$SERVER.LOG ?  These
    should indicate if the server process terminated via some 'normal'
    reason, or whether something a little more unusual is going on.
    
    For instance, when ALL-IN-1 is shut down (and hence the server), the
    following messages are written to the log file prior to the server
    process stopping :
    
    3-MAR-1993 16:59:21.37  Server: TRON::"73="  
    Error: %MCC-E-ALERT_TERMREQ, thread termination requested  
    Message: CsiCacheBlockAstService; Error from mcc_astevent_receive
    
    3-MAR-1993 16:59:26.13  Server: TRON::"73="  
    Error: %MCC-E-ALERT_TERMREQ, thread termination requested  
    Message: SrvTimeoutSysMan; receive alert to terminate thread
    
    
    Are you running housekeeping procedures which shutdown ALL-IN-1, but
    the problem occurs because they are not being started up properly ?
    
    If there's no hints or clues in the log file, I think you need to find
    out when the server dies, and if there's any consistancy. Usually the
    log file will indicate if the server is unhappy.
    
    
    
    Kevin.
    
    
2353.2multiple object 73's?CHRLIE::HUSTONThu Mar 04 1993 14:4123
    
    Sunil,
    
    as Kevin said, the server should not just "die", if it is being 
    shut down nicely by someone, there will be several log messages
    in oafc$server.log about thread termination requested. If these are
    there someone is telling the FCS to shutdown. 
    
    If there is nothing there, other than startup messages, then my
    guess is that someone is either doing a stop/id=FCS_PID or 
    another possiblity, not sure how this would work, is if someone else
    is starting something up as DECnet object 73, either another server
    or some other application. Not sure what the effects of this would
    be, but having multiple applications up with the same obj number is
    bad.
    
    If you can get some sort of guess as to when the process goes away, 
    it would help, turn tracing on just before that and see what happens.
    
    Sorry we can't give you more to go on.
    
    --Bob
    
2353.3More infoBUSHIE::SETHIMan from DownunderFri Mar 05 1993 00:0936
    Hi Bob and Kevin,

    Having looked at the server log and your example there does seem to be
    a difference.  The users were unable to access their shared drawers at
    13:30 yesterday and here is part of the log:

    3-MAR-1993 06:29:39.30  Server: AUTC01::"73="  Message: Startup for
    File Cabinet Server V1.0 complete

    3-MAR-1993 22:57:24.14  Server: AUTC01::"73="  Error: %DSL-W-SHUT,
    Network shut down  Message: Shutting Down server, network failure.
     
    4-MAR-1993 10:04:04.47  Server: AUTC01::"73="  Message: Startup for
    File Cabinet Server V1.0 complete

    4-MAR-1993 13:29:38.13  Server: AUTC01::"73="  Message: Startup for
    File Cabinet Server V1.0 complete

    The server was started at 4-MAR-1993 10:04:04.47 and in between it died
    and the customer restarted it at 4-MAR-1993 13:29:38.13.  No error
    message are in the logfile to point to the reason for the failure. Please 
    note the customer reboots his system every night at 11:00 pm.  
    
    I have asked the customer to enable accounting to enable me to get
    extra information.  I have copied the logfile to RIPPER::Q30178.LOG_2
    it may have something in there that I just did not pick up.  Hopefully
    either the server trace will pickup something or the account.
    
    Finally the customer has assured me that they do not have other
    applications running on the system therefore object 73 is not being
    used for anything else.
    
    Thanks for you advise will keep you posted,
    
    Sunil
    
2353.4I'll look atthe logIOSG::STANDAGEOink...Oink...MoooooooooooooooooooooooooooooooooFri Mar 05 1993 09:2225
    
    Sunil,
    
    As you said, the system is rebooted at 11pm, so that explains the 
    "Error: %DSL-W-SHUT,Network shut down" message. As the system is about
    to go away the server shuts itself down.
    
    So it appears that the problem occurs between the last two startup
    messages. As nothing else has been logged the server certainly did not
    die from natural causes, at least it doesn't seem that way. Even if
    someone is doing something to seriously upset the server, some form of
    message would appear in the log.
    
    When I get time I'll take a look at the log you have provided. The next
    step is to probably see if the server seems to go away around the same
    time each day.
    
    At the moment, the only way I can see this happening is if someone did
    a STOP PROC/ID of the process.
    
    
     
    Kevin.
    
    
2353.5STOP/ID writes messages to the log fileSCOTTC::MARSHALLSpitfire Drivers Do It ToplessFri Mar 05 1993 10:157
Re: STOP/ID

When I do that, several "thread termination" messages get written to the log
file.  So it doesn't look like anyone's doing that (unless they also lock the
log file first to stop the server writing to it! :-)

Scott
2353.6You won't always get "thread termination" IOSG::STANDAGEOink...Oink...MoooooooooooooooooooooooooooooooooFri Mar 05 1993 11:2411
    
    Scott,
    
    This isn't always the case, it very much depends of what is happening
    on the system at the time. I just did this on a test machine (server
    state "HIB") - and no thread termination messages were produced.
   
    
    Kevin.
    
    
2353.7run the server in the foregroundCHRLIE::HUSTONFri Mar 05 1993 15:3748
    
    I can think of 2 ways to have the server go away with no message:
    
    1) stop/id -- I have never seen it log a message, the process is
    stopped immediately so it won't have enough time to write a message.
    This is usually how we stop servers during our testing.
    
    2) The server itself access violated.  The server runs as two layers,
    the bottom layer does about 98% of the work and any access violation
    at this level will be written to the log file via a condition handler.
    The upper level does all the dasl and DECnet interaction, it has no
    condition handler and runs at AST level. THese routines are called 
    by DASL in response to certain DASL events such as receiving a 
    DASL message. Unfortunately, since the server runs as a detached
    process if this layer access violates the process will silently go
    away.  A problem at this layer could be either the server, or
    DASL. Do you know what version of DASL they are using? The FCS
    ships with V2.0, I know that there is a V2.2, we have not tested 
    against it, and theoretically it should work due to backwards 
    compatibility, but who knows, maybe there is a problem
    
    What can you do next?
    
    Start the server in the foreground, not through ALL-IN-1.  Do the
    following:
    
    $ A1FCS :== $sys$system:oafc$server.exe
    $ A1FCS your_configuration_file.dat
    
    to get you config file name, go to the MS menu and do a R on the
    server, it will show you the config file.
    
    Note that when you start the server up like this, the server is running
    in the context of the process you do the command from. Your best choice
    for this is to log into the OAFC$SERVER account (made during
    installation), you may have to mess around in the UAF record to 
    allow logins since the account is installed as DISUSER'd.  If this is
    not do-able, the next best choice is the ALLIN1 account or SYSTEM,
    either should have suitable privs and quotas to run the server.
    
    When you do this, if the server access violates at the top level, you
    will see the access violation on the screen, please save it and either
    send it to me or post it here.
    
    Thanks
    
    --Bob
    
2353.10Changed some sysuaf parameters and monitoringBUSHIE::SETHIMan from DownunderFri Mar 12 1993 05:4540
    Hi All,

    The customer had the problem reoccur yet again and we had accounting
    enabled but the customer forgot to turn on tracing (makes me feel
    grumpy 8*{).

    The accounting file did not have a record for the process nor did the
    OAFC$SERVER.LOG file, I also did an analyze/error/include=bugcheck and
    found nothing.
    
    I than audited the OAFC$SERVER account and the SYSTEM account and found
    the following:
    
    mod OAFC$SERVER/BIOlm=50/DIOlm=50/astlm=100/TQElm=50/enqlm=300, I other
    words :-) these quotas were 5 times below what I changed them to.  The
    system account did not have the OA$MANAGER identifier, I don't know if
    it required it but I granted it as per my system.
    
    I asked the customer to reboot the system and he did so during the
    lunch hour.  So far he has not reported any problems and it seems that
    this is the first time after a reboot he has not had any minor or major
    problems.  I will monitor the system and report back any findings.
    
    One thing though why has the accounting file not got an entry for the
    process starting and stopping ?  Accounting was enabled before ALL-IN-1
    was started.  
    
    One last question Bob ;-),
    
    What is DASL ? How do I find out what version the customer has
    installed ?
    
    >$ A1FCS :== $sys$system:oafc$server.exe
    >$ A1FCS your_configuration_file.dat
    
    I did all of this no stack dumps etc.
    
    Regards,
         
    Sunil
2353.11DASL = DECNet i/f; Care with Trace file size...CHRLIE::HUSTONFri Mar 12 1993 13:3728
    DASL is Distributed Service Application Layer. It is a protocol that
    lays on top of DECnet, the FCS uses it for all its DECnet work. Removes
    us from needing to make DECnet calls.  DASL is not shipped as a product
    if a shipping product needs it (like the FCS) then it is up to that
    product to supply DASL. We include V2.0 in the kits so they have at 
    least V2.0.
    
    Ok, if this never stack dumped, did it simply go away?  You said the 
    server went away again, was there no message at this terminal?
    Running the server is this manner simply runs the server is the
    foreground process rather than as a detached process. If you run
    the server in this way and it access violates outside the scope of
    the condition handler, then you would see the access violation. If
    the process simply died, not sure how, then what you would probably 
    see is the startup message, then a '$' saying you were done and back
    at DCL.
    
    Before you do this, please go into ALL-IN-1 and stop the server that
    ALL-IN-1 starts, else all kinds of fun things happen.
    
    Also, if you cannot narrow down a time or circumstance that the server
    goes away on, I do not recommend turning tracing on. Each trace record
    is 1024 bytes and each request to the server takes an ABSOLUTE MINIMUM
    of 2 trace records or 2048 bytes. Most events take more than 2 trace 
    records. So running the FCS with tracing on all the time is rather
    disk intensive.
    
    --Bob

2353.12BUSHIE::SETHIMan from DownunderTue Mar 16 1993 04:3219
    G'day All,
    
    The problem has been solved.  Basically it was a bit of this and a bit
    of that :-).
    
    The problem was caused by a in-house process killing job running on a
    batch queue.  Aaaaahhhhh !!!! I had asked the customer many a time if a
    stop/id= was being done on the process and he said "No".
    
    The lesson of this hair pulling story is:
    
    1. Never trust a customer when he say's no to the obvious question
    2. Show system does not always show process killers, especially when there
       process names have not been set.
    3. Process killers can run on batch queues
    
    Thanks to all of you for your help,
    
    Sunil