[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
| Title: | *OLD* ALL-IN-1 (tm) Support Conference | 
| Notice: | Closed - See Note 4331.l to move to IOSG::ALL-IN-1 | 
| Moderator: | IOSG::PYE | 
|  | 
| Created: | Thu Jan 30 1992 | 
| Last Modified: | Tue Jan 23 1996 | 
| Last Successful Update: | Fri Jun 06 1997 | 
| Number of topics: | 4343 | 
| Total number of notes: | 18308 | 
3391.0. "RMS non-fatal bugcheck - corruption?" by GRANPA::BSPANGLER () Wed Oct 13 1993 19:31
    Customer is running a 4 node cluster, each node running ALL-IN-1 3.0, VMS
    v5.5-2.  2100 subscribers, 700-800 concurrent, 2 mail areas.
    
    There have been numerous instances of RMS non-fatal bugchecks over the
    last few months on all 4 nodes, but usually ocurring on one specific
    node.  The bugcheck dumps the user's process.  Problem is consistent
    with that you would see with file corruption.
    
    But all system data files, including the SDAFs, are converted and
    optimized nightly, with no corruption reported.
    
    However, a little over a week ago, this problem manifested itself as
    the result of a truly corrupted SDAF (VBN errors).  Attempting to
    access the EM screen aborted the VMS process on one node, would hang
    the process on another node.  Users were pointed
    away from that mail area, and falling back to a good SDAF, then 
    running TRM repaired most of the damage.  No culprit could be found
    for why the file got corrupted in the first place - unless whatever is
    going on here *caused* the corruption rather then resulted from it.
    
    Now that there is no known file-level corruption again, (in fact the
    above-mentioned corrupted SDAF is closed) the problem is happening
    again - rarely, but it is happening - on various nodes.  It happens on
    an attempt to send mail (most of the time, I think - users are not
    always sure).  No info is written to OA$MTI_ERR - but when we had the
    known corruption, VBN errors were seen there.
    
    Any ideas or clues would be greatly appreciated.
    
    Bob Spangler    
    
                                            
    P.S. When this sort of non-fatal bugcheck occurs, RMS places a hex value
    in register R2.  Most of the error logfile  entries report one of the
    following:
    
    FFFFFFFD 	BADIFAB		invalid ifab or irab
    
    FFFFFFF1	DEALLERR	ifab deallocation attempted with block(s)
    				still allocated      
       
| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 3391.1 | Some hints to trace the cause of the problem | GIDDAY::SETHI | He's wound them up no end :-) | Thu Oct 14 1993 01:04 | 25 | 
|  |     Hi,  
    
    To find out what has caused the problem we have to examin the process
    dump and/or the system dump.
    
    To get more details I would do the following:
    
    1. analyze the error log (I guess you have done this)
    
    2. in the sylogin.com add the following $set process/dump, this would
       create a dump file in the user directory.  Hopefully giving us more
       information, (the file will be called <image name that was running
       OA$MAIN>.DMP.  The CSC may well be the best bet to get this dump
       file examined.
    
    3. if there is not enough information there than you may have to set
       the BUGCHECKFATEL parameter in SYSGEN to 1, to force a system crash
       this will give much more information.  BUGCHECKFATEL is a dynamic
       parameter so you will not require a re-boot of the cluster to set
       it.  Again the CSC will be of help here when it comes to examining
       the dump.
    
    Regards,
    
    Sunil
 | 
| 3391.2 | Check file backup procedures | IOSG::MAURICE | Differently hirsute | Thu Oct 14 1993 10:06 | 7 | 
|  |     Also examine your backup procedures. Even a closed SDAF has its records
    updated (when messages are deleted). Do the procedures backup the files
    while they are open? If they do then a subsequent restore is disastrous.
    
    Cheers
    
    Stuart
 |