T.R | Title | User | Personal Name | Date | Lines |
---|
2731.1 | FCS problem? | IOSG::MAURICE | Night rolls in, my dark companion | Wed May 19 1993 16:26 | 10 |
| The messages originate from the File Cabinet server, so I recommend
you check the log files to see if there is further information in
there.
The messages are not really informational. They get reported by the FCS
as error, but get downgraded by the ALL-IN-1 client before display.
HTH
Stuart
|
2731.2 | GONE ! DISAPPEARED ! | BIS1::DESTRIJCKER | Back again to the home town | Mon May 24 1993 17:14 | 9 |
| Well, this was an easy one. The error corrected itself. Maybe File
cabinet reorganize had something to do with it. This housekeeping
procedure was run over the weekend.
I can not reproduce it anymore. All that remains is the trace.
Oh well, thanks for the help anyway.
Wivine.
|
2731.3 | | BUSHIE::SETHI | Ahhhh (-: an upside down smile from OZ | Tue May 25 1993 01:57 | 12 |
| Hi Wivine,
My guess is that the DOCDB.DAT or DAF.DAT had a problem with the File
Access Block of some kind (corruption). The File cabinet
reorganisation did a convert/fdl=oa$data:docdb.fdl (or pdaf.fdl) and
cleared the problem. It would be interesting to know when you do an
analyze/rms/check on the previous versions of the above mentioned
.dat's if any errors are reported.
Regards,
Sunil
|
2731.4 | Nothing to go by. | BIS1::DESTRIJCKER | Back again to the home town | Thu Jun 03 1993 15:12 | 17 |
| Yes, that could well be the case for this occurrence. Reorganise filing
cabinets runs every weekend. And it was working OK again on the
following monday. I'm happy with this one.
But! It doesn't explain why last week tuesday the error popped up again
with somebody else who was trying to forward a message from a shared
drawer (not her's). The next day it worked 8-). Only CDQ runs daily,
and EW every other day but it doesn't reorganise files.
Since then I haven't had any complaints anymore. This FAB error seems
to be very temperamental. I shall start monitering the cluster. Perhaps
it has to do with system resources! I've been told that there are
periods it occurs daily and periods when it doesn't happen at all.
I'll keep you all posted if I do find something new.
Wivine.
|
2731.5 | Check the disks first done repair them before checking | TINNIE::SETHI | Ahhhh (-: an upside down smile from OZ | Fri Jun 04 1993 01:28 | 24 |
| Hi Wivine,
>Since then I haven't had any complaints anymore. This FAB error seems
>to be very temperamental. I shall start monitering the cluster.
Since this is happening to others what I would suggest is that you do
an $analyze/disk/read_check/norepair on your disks. Ask the customer
if they have done an $analyze/disk/read_check/REPAIR note I put the
repair in uppercase. OpenVMS version 5.5-1 and below had a slight
misfeature in that they actually corrupted disks if the repair was
used. This ONLY happened under certain circumstances so check the
error with the OpenVMS support group (CSC), before you repair the disk.
By the way if the customer did repair the disk and it's corrupted there
is no way of repairing the damage. Please don't say anything to the
customers let your manager deal with it.
Why I have mentioned the above is because I have delt with a number of
calls that had the above types of problems. Also the above problem has
been fixed in 5.5-2 again doing a repair will not fix the problem. If
you need more help let me know.
Regards,
Sunil
|
2731.6 | It's worth a try | BIS1::DESTRIJCKER | Back again to the home town | Fri Jun 04 1993 10:52 | 12 |
|
Thanks for the advice, I'll schedule an analyse disk maybe this
weekend. Sounds definately worthwhile trying.
BTW, the customer happens to be Digital itself, the IS department in
Brussels. I support also the Luxemburg ALL-IN-1 machine and another
ALL-IN-1 cluster here in Brussels. The FAB error only occurs on the
biggest ALL-IN-1 cluster.
I'll keep you posted.
Wivine.
|
2731.7 | Please do not discuss this with a customer*WARNING* | TINNIE::SETHI | Ahhhh (-: an upside down smile from OZ | Mon Jun 07 1993 02:57 | 37 |
| Hi Wivine,
The type of error that would indicate that the data on your disk maybe
corrupted is:
The following error messages MAY be returned on systems
experiencing this problem:
VERIFY-I-MULTALLOC, file ('file-id') 'filename' multiply
allocated blocks VBN 'n' to 'n' LBN 'n'
to 'n', RVN 'n'
VERIFY-I-LOSTEXTHDR, file ('file-id') 'filename' lost
extension file header
VERIFY-I-MAPAREA, file ('file-id) 'filename' invalid map area
NOTE: This problem ONLY occurs when repairing a volume
with lost extension file headers. It does not occur
every time you repair a disk volume using the VERIFY
Utility.
A stars article called "OpenVMS] ANALYZE/DISK/REPAIR Causes Mult
Allocated Blocks/Corruption", has all the details.
Again I must emphasise please don't discuss this with your customers
let your manager deal with this. It's a very sensitive issue as I have
found out at some sites and you don't want to get involved in the
politics of this. Also the article warns you not to discuss this with
the customer, that does not mean that we forget about the problem.
I think I may have a site with this problem I am crossing my fingers it
isn't.
Regards,
Sunil
|
2731.8 | Probably a VM corruption in FCS. | IOSG::CHINNICK | gone walkabout | Tue Jun 08 1993 13:56 | 39 |
|
Hi Wivine...
Personally, I doubt that this error results from a disk corruption or
even an RMS file corruption.
The RMS$_FAB and RMS$_RAB statuses reflect that the File ACcess Block
or Record Access block are not at a valid address or have been
corrupted in some way.
These blocks are used for access through the RMS services and in no way
relate to disk structures. They will be allocated in memory by the FCS
or the IOS kernel (depending on what you are using and accessing) and
the address o fthese blocks are passed to servioces such as $OPEN, $GET,
$PUT etc.
Most likely, the problem you have results from some form of memory
corruption taking place inside the FCS. This conference is littered
with similar problems where files are being left open or other errors
are occuring.
The problem with FCS is that it is an extrememly complex piece of
software which performs the file cabinet manipulation as does ALL-IN-1
but also has to worry about authentication and communication with
clients AND running multiple threads. You might get this error because
of something else which someone completely different has asked the FCS
to perform.
The most probably cause of this error can be corruption of the DAF
records in the SDAF, PDAFs in shared drawers or PENDING. I'd suggest
that you try to get these files checked out or at the very least see if
TRU/TRM is getting run on the site and if any problems are being
reported. [CSC's have some tools which can help here.]
You might well check your FCS logs and see if you've been getting
things like thread ACCVIOs or other conditions - these would be
confirmation that the FCS is getting this type of error.
Paul.
|
2731.10 | Probably not the FCS... | CHRLIE::HUSTON | | Tue Jun 08 1993 15:20 | 14 |
|
re .8 and .9
I don't think it is the FCS, simply becuase the FCS would return a
status of OafcRmsError, not the actaul RMS error. You would simply
get an error saying there was an RMS error, if you hit GOLD-W you would
then possibly see the actual RMS error. In the FCS, if ANY RMS
operation returns an error, it is masked to OafcRmsError (same for
DASL errors, they go to OafcDaslError). Why? simple, the person
who the error is returned to may be non-VMS, in which case giving
them an RMS error would be meanlingless.
--Bob
|
2731.11 | Sure looks like FCS | IOSG::CHINNICK | gone walkabout | Tue Jun 08 1993 15:40 | 15 |
|
Not the FCS? I might just beg to differ on that count.
Well, the text quoted is for OafcRmsError status from OAFC$MESSAGES.MSG.
There are no instances of this status or message in the IOS code.
And the FCS does return the 'extended' RMS error status does it not?
(Or so the sources would seem to indicate.)
ALL-IN-1 reports the extended status as well as the Oafc status.
Paul.
|
2731.12 | Ok, so I was wrong... | CHRLIE::HUSTON | | Tue Jun 08 1993 17:17 | 28 |
|
Ok, I was under the impression that OafcRmsError was not being
returned, just the actual RMS error.
If OafcRmsError is being returned then the error is definetly coming
from the FCS. Sorry for the misunderstanding.
Therefore, there are probably 2 ways this can occur:
1) Internal FCS corruption as you pointed out, the FCS built the
FAB and when it later used it, something had stepped on some
portion of it.
2) The information that the FCS is reading to build the FAB is bad.
I would lean towards 2 for the simple fact that if something is
corrupting memory, it would have a tendancy to show itself in alot
of ways (depending on the size of the corruption of course), and would
tend to go away when the memory that is corrupted is freed.
THe FCS gets the info from a variety of places, mostly from either
RMS itself, FC files (DOCDB, DAF etc), or from previous functions.
I will go back and re-read this string and see if anythign jumps out
at me. I have not had time to keep up wiht all the possible FCS
problems and this is one that I haven't been reading.
--Bob
|
2731.13 | It's back again ! | BIS1::DESTRIJCKER | Back again to the home town | Thu Jun 10 1993 13:44 | 33 |
| Hi again,
Yes, it's happened again. A user is using RFD to refile a from one
personal drawer to another personal drawer and gets the Invalid FAB -
or FAB not accessible. He encountered the same problem last week and as
usual it went away but it came back.
I looked at the system, which isn't very busy at all. The disk has got
over 300000 blocks free and no errors. I can't analyse his .DAT files
since he's got them open. User has got enough diskquota left.
Paul,
on the subject of file cabinet server logs, the oafc$server.log does
have the same error in it. Unfortunately I only got todays logfile
left. oafc$server_error.log is empty. The startup file claims the
server was started successfully. Would it be worth running the server
as a foreground job?
TRM is sheduled for this weekend. There seems to be something wrong
here. I've got 2 sm_fcvr_mail_area log files 5 minutes apart. Both have
the SMJACKET error: Internal error in housekeeping procedure,
performing %SMJACKET exit and cleanup processing. It then starts the
servers (3 of them).
Would it be a good idea to schedule TRU as well the day after, since
it's happens to be a sunday.
Any further suggestions, ideas are more than welcome.
Regards,
Wivine.
|
2731.14 | Couple things to do... | CHRLIE::HUSTON | | Thu Jun 10 1993 15:01 | 36 |
|
re .13
>on the subject of file cabinet server logs, the oafc$server.log does
>have the same error in it. Unfortunately I only got todays logfile
>left. oafc$server_error.log is empty. The startup file claims the
>server was started successfully. Would it be worth running the server
>as a foreground job?
All this would do for you, is instead of writing the invalid FAB
error to the log file, you would see it on the screen. Without the
source code, running in the foreground is not very usefull.
If you are sure that this is being done by the RFD, then contact
me off line. I have one thing you can try that will give me more
information (like what FCS routines are being called). I would rather
not put it in here since I don't want everyone doing it, and I am
not positive it will work.
Another thing to try is: Get the system to a state that this is
easily reproducible for a user, say user X. Enable server tracing. Have
the user do what ever he needs to to get the error. Filter out any
session information not related to the user, then post the formatted
log file here. It should not be to big after you filter out all the
other sessions. Leave EVERYTHING that has to do with this users
session.
Can you also do an $analyze/image sys$system:oafc$server.exe and
sys$share:oafc$client_shr.exe and tell me what the image IDs of them
are?
Thanks
--Bob
|
2731.15 | Not helpful... but... | IOSG::CHINNICK | gone walkabout | Thu Jun 10 1993 15:07 | 24 |
| Hi Wivine...
Well, it's kind of difficult to say what you should do here.
My money would be on there being a problem with one or more DAF
records.
We're investigating the FCS at the moment because I think it doesn't
cope with corrupt DAF records. Unfortunately, certain forms of corrupt
DAF record are not corrected by FCVR either!
Then, it may not be either of the users/drawers involved which is
responsible but another completely separate thread. Fun - huh?!
CSC could probably find any DAF corruption and cure it, but it is beyond
the allowable scope here. The tool they can use is 'restricted use'.
Of course - it might be something completely different, but I think I'd
offer long odds.
I'll have to give this some thought as to how to procede. In the
meantime, I fully expect it to recur regularly.
Paul.
|
2731.16 | So long and thanks for th efish. | BIS1::DESTRIJCKER | Back again to the home town | Wed Jun 16 1993 16:32 | 12 |
|
I haven't forgotten you all, honest.
I have organised myself so that the user who have encountered this nice
Invalid FAB error will contact me and I can without delay switch FCS
tracing on, let them reproduce the error and hopefully I will be able
to pass to you (I'll have a look at it too, so you don't feel lonely)
some valuable information.
Talk to you in the near future, I hope.
Wivine.
|
2731.17 | Stumble .. stumble ... ouch! | BIS1::DESTRIJCKER | Back again to the home town | Fri Jun 18 1993 13:18 | 24 |
|
Hi,
This may or may not be related but last night I stumbled on 733 new
mail messages in the postmasters account. The first one and also the
eldest dated 17-Nov-1987, had a NOTED status, NO header and NO text.
There was a 0 block .TXT file in one of the shared ares though.
The second message in the list was a bit younger i.e. 29_dec-1992, had
a READ status and looked OK for the rest. Both messages were in the
INBOX folder!
I cleared out all these messages, noticed that the mail count was out
by 40 but I could not delete these 2 messages. I removed them manually.
Verifying the DOCDB, it complained that the MAIL_ORIG field contained
invalid characters. I tried to read this field to no avail. Funny thing
is that all subsequent messages received get the same complaint about
these invalid characters in the DOCDB field MAIL_ORIG. Does this mean
the postmaster's DOCDB isn't healthy? Should I give it a new one?
In the mean time FCV is still OK.
Wivine.
|
2731.18 | Don't worry about that... | IOSG::CHINNICK | gone walkabout | Fri Jun 18 1993 13:45 | 23 |
|
Hi Wivine...
Don't worry about the MAIL_ORIG field - it's because DOCDB has changed
layout in V3.0 that you'll get that. In fact - don't worry about DOCDB
at all - it won't cause any problems normally.
Much more to worry about is the DAF files... SDAF and PDAF because
these have a much more complex structure and if they go wrong, nasty
things start happening. If you have problems on your DAF files that is
the most likely thing to cause errors such as those observed.
POSTMASTER is important in the context of MAIL delivery - it's used by
Sender/Fetcher - but it probably isn't too relevent to FCS. I'd
concentrate on the SDAF files and the DAF.DAT files in drawer
directories.
Cleaning out POSTMASTER regularly is a good idea for the helth and
performance of your MAIL system however.
Regards,
Paul.
|
2731.19 | Keep digging . . . . . | BIS1::DESTRIJCKER | Back again to the home town | Fri Jun 18 1993 15:02 | 6 |
|
Hi Paul,
DAF's it'll be then in what ever format.
Wivine.
|
2731.20 | Shouldn't POSTMASTER be set to NOMAIL anyway? | IOSG::PYE | Graham - ALL-IN-1 Sorcerer's Apprentice | Fri Jun 18 1993 15:50 | 1 |
|
|
2731.21 | But of course. | BIS1::DESTRIJCKER | Back again to the home town | Mon Jun 21 1993 10:05 | 8 |
| Graham,
Yes, indeed. The SENDER and FETCHER accounts used by the sender and
fetcher are set to NO MAIL. The POSTMASTER account isn't, maybe it
should. The messages it receives are mainly delivery failures from
messages sent through X400 by people who are not authorized to do so.
Wivine.
|
2731.22 | Working... | IOSG::CHINNICK | gone walkabout | Mon Jun 21 1993 11:34 | 23 |
|
Wivine,
OK... you are running Concurrent Sender/Fetcher so my comments about
"keeping clean" apply to SENDER and FETCHER accounts. Even so, these
accounts are still not related to your FCS problems I'd expect.
As for POSTMASTER being set to NOMAIL... I'm not too sure about what
effect this would have. GAP might have a better idea? In any event -
this is a side issue.
I should also mention that we are still looking at the FCS code. Looks
a bit dodgey in a few places! I always seem to arrive 12 months too
late to circumvent these problems!
With luck, any PFR might benefit from this detailed probing of FCS
entrails. Not sure about producing any patches at this stage.
Will keep you posted,
Paul.
|