[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | *OLD* ALL-IN-1 (tm) Support Conference |
Notice: | Closed - See Note 4331.l to move to IOSG::ALL-IN-1 |
Moderator: | IOSG::PYE |
|
Created: | Thu Jan 30 1992 |
Last Modified: | Tue Jan 23 1996 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 4343 |
Total number of notes: | 18308 |
3391.0. "RMS non-fatal bugcheck - corruption?" by GRANPA::BSPANGLER () Wed Oct 13 1993 20:31
Customer is running a 4 node cluster, each node running ALL-IN-1 3.0, VMS
v5.5-2. 2100 subscribers, 700-800 concurrent, 2 mail areas.
There have been numerous instances of RMS non-fatal bugchecks over the
last few months on all 4 nodes, but usually ocurring on one specific
node. The bugcheck dumps the user's process. Problem is consistent
with that you would see with file corruption.
But all system data files, including the SDAFs, are converted and
optimized nightly, with no corruption reported.
However, a little over a week ago, this problem manifested itself as
the result of a truly corrupted SDAF (VBN errors). Attempting to
access the EM screen aborted the VMS process on one node, would hang
the process on another node. Users were pointed
away from that mail area, and falling back to a good SDAF, then
running TRM repaired most of the damage. No culprit could be found
for why the file got corrupted in the first place - unless whatever is
going on here *caused* the corruption rather then resulted from it.
Now that there is no known file-level corruption again, (in fact the
above-mentioned corrupted SDAF is closed) the problem is happening
again - rarely, but it is happening - on various nodes. It happens on
an attempt to send mail (most of the time, I think - users are not
always sure). No info is written to OA$MTI_ERR - but when we had the
known corruption, VBN errors were seen there.
Any ideas or clues would be greatly appreciated.
Bob Spangler
P.S. When this sort of non-fatal bugcheck occurs, RMS places a hex value
in register R2. Most of the error logfile entries report one of the
following:
FFFFFFFD BADIFAB invalid ifab or irab
FFFFFFF1 DEALLERR ifab deallocation attempted with block(s)
still allocated
T.R | Title | User | Personal Name | Date | Lines |
---|
3391.1 | Some hints to trace the cause of the problem | GIDDAY::SETHI | He's wound them up no end :-) | Thu Oct 14 1993 02:04 | 25 |
| Hi,
To find out what has caused the problem we have to examin the process
dump and/or the system dump.
To get more details I would do the following:
1. analyze the error log (I guess you have done this)
2. in the sylogin.com add the following $set process/dump, this would
create a dump file in the user directory. Hopefully giving us more
information, (the file will be called <image name that was running
OA$MAIN>.DMP. The CSC may well be the best bet to get this dump
file examined.
3. if there is not enough information there than you may have to set
the BUGCHECKFATEL parameter in SYSGEN to 1, to force a system crash
this will give much more information. BUGCHECKFATEL is a dynamic
parameter so you will not require a re-boot of the cluster to set
it. Again the CSC will be of help here when it comes to examining
the dump.
Regards,
Sunil
|
3391.2 | Check file backup procedures | IOSG::MAURICE | Differently hirsute | Thu Oct 14 1993 11:06 | 7 |
| Also examine your backup procedures. Even a closed SDAF has its records
updated (when messages are deleted). Do the procedures backup the files
while they are open? If they do then a subsequent restore is disastrous.
Cheers
Stuart
|