[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:*OLD* ALL-IN-1 (tm) Support Conference
Notice:Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:IOSG::PYE
Created:Thu Jan 30 1992
Last Modified:Tue Jan 23 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:4343
Total number of notes:18308

3391.0. "RMS non-fatal bugcheck - corruption?" by GRANPA::BSPANGLER () Wed Oct 13 1993 20:31

    Customer is running a 4 node cluster, each node running ALL-IN-1 3.0, VMS
    v5.5-2.  2100 subscribers, 700-800 concurrent, 2 mail areas.
    
    There have been numerous instances of RMS non-fatal bugchecks over the
    last few months on all 4 nodes, but usually ocurring on one specific
    node.  The bugcheck dumps the user's process.  Problem is consistent
    with that you would see with file corruption.
    
    But all system data files, including the SDAFs, are converted and
    optimized nightly, with no corruption reported.
    
    However, a little over a week ago, this problem manifested itself as
    the result of a truly corrupted SDAF (VBN errors).  Attempting to
    access the EM screen aborted the VMS process on one node, would hang
    the process on another node.  Users were pointed
    away from that mail area, and falling back to a good SDAF, then 
    running TRM repaired most of the damage.  No culprit could be found
    for why the file got corrupted in the first place - unless whatever is
    going on here *caused* the corruption rather then resulted from it.
    
    Now that there is no known file-level corruption again, (in fact the
    above-mentioned corrupted SDAF is closed) the problem is happening
    again - rarely, but it is happening - on various nodes.  It happens on
    an attempt to send mail (most of the time, I think - users are not
    always sure).  No info is written to OA$MTI_ERR - but when we had the
    known corruption, VBN errors were seen there.
    
    Any ideas or clues would be greatly appreciated.
    
    Bob Spangler    
    
                                            
    P.S. When this sort of non-fatal bugcheck occurs, RMS places a hex value
    in register R2.  Most of the error logfile  entries report one of the
    following:
    
    FFFFFFFD 	BADIFAB		invalid ifab or irab
    
    FFFFFFF1	DEALLERR	ifab deallocation attempted with block(s)
    				still allocated      
       
T.RTitleUserPersonal
Name
DateLines
3391.1Some hints to trace the cause of the problemGIDDAY::SETHIHe's wound them up no end :-)Thu Oct 14 1993 02:0425
    Hi,  
    
    To find out what has caused the problem we have to examin the process
    dump and/or the system dump.
    
    To get more details I would do the following:
    
    1. analyze the error log (I guess you have done this)
    
    2. in the sylogin.com add the following $set process/dump, this would
       create a dump file in the user directory.  Hopefully giving us more
       information, (the file will be called <image name that was running
       OA$MAIN>.DMP.  The CSC may well be the best bet to get this dump
       file examined.
    
    3. if there is not enough information there than you may have to set
       the BUGCHECKFATEL parameter in SYSGEN to 1, to force a system crash
       this will give much more information.  BUGCHECKFATEL is a dynamic
       parameter so you will not require a re-boot of the cluster to set
       it.  Again the CSC will be of help here when it comes to examining
       the dump.
    
    Regards,
    
    Sunil
3391.2Check file backup proceduresIOSG::MAURICEDifferently hirsuteThu Oct 14 1993 11:067
    Also examine your backup procedures. Even a closed SDAF has its records
    updated (when messages are deleted). Do the procedures backup the files
    while they are open? If they do then a subsequent restore is disastrous.
    
    Cheers
    
    Stuart