[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:	OLD ALL-IN-1 (tm) Support Conference
Notice:	Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:	IOSG::PYE

Created:	Thu Jan 30 1992
Last Modified:	Tue Jan 23 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4343
Total number of notes:	18308

3391.0. "RMS non-fatal bugcheck - corruption?" by GRANPA::BSPANGLER () Wed Oct 13 1993 19:31

    Customer is running a 4 node cluster, each node running ALL-IN-1 3.0, VMS
    v5.5-2.  2100 subscribers, 700-800 concurrent, 2 mail areas.
    
    There have been numerous instances of RMS non-fatal bugchecks over the
    last few months on all 4 nodes, but usually ocurring on one specific
    node.  The bugcheck dumps the user's process.  Problem is consistent
    with that you would see with file corruption.
    
    But all system data files, including the SDAFs, are converted and
    optimized nightly, with no corruption reported.
    
    However, a little over a week ago, this problem manifested itself as
    the result of a truly corrupted SDAF (VBN errors).  Attempting to
    access the EM screen aborted the VMS process on one node, would hang
    the process on another node.  Users were pointed
    away from that mail area, and falling back to a good SDAF, then 
    running TRM repaired most of the damage.  No culprit could be found
    for why the file got corrupted in the first place - unless whatever is
    going on here *caused* the corruption rather then resulted from it.
    
    Now that there is no known file-level corruption again, (in fact the
    above-mentioned corrupted SDAF is closed) the problem is happening
    again - rarely, but it is happening - on various nodes.  It happens on
    an attempt to send mail (most of the time, I think - users are not
    always sure).  No info is written to OA$MTI_ERR - but when we had the
    known corruption, VBN errors were seen there.
    
    Any ideas or clues would be greatly appreciated.
    
    Bob Spangler    
    
                                            
    P.S. When this sort of non-fatal bugcheck occurs, RMS places a hex value
    in register R2.  Most of the error logfile  entries report one of the
    following:
    
    FFFFFFFD 	BADIFAB		invalid ifab or irab
    
    FFFFFFF1	DEALLERR	ifab deallocation attempted with block(s)
    				still allocated

T.R	Title	User	Personal Name	Date	Lines
3391.1	Some hints to trace the cause of the problem	GIDDAY::SETHI	He's wound them up no end :-)	`Thu Oct 14 1993 01:04`	25
	Hi, To find out what has caused the problem we have to examin the process dump and/or the system dump. To get more details I would do the following: 1. analyze the error log (I guess you have done this) 2. in the sylogin.com add the following $set process/dump, this would create a dump file in the user directory. Hopefully giving us more information, (the file will be called <image name that was running OA$MAIN>.DMP. The CSC may well be the best bet to get this dump file examined. 3. if there is not enough information there than you may have to set the BUGCHECKFATEL parameter in SYSGEN to 1, to force a system crash this will give much more information. BUGCHECKFATEL is a dynamic parameter so you will not require a re-boot of the cluster to set it. Again the CSC will be of help here when it comes to examining the dump. Regards, Sunil
3391.2	Check file backup procedures	IOSG::MAURICE	Differently hirsute	`Thu Oct 14 1993 10:06`	7
	Also examine your backup procedures. Even a closed SDAF has its records updated (when messages are deleted). Do the procedures backup the files while they are open? If they do then a subsequent restore is disastrous. Cheers Stuart