T.R | Title | User | Personal Name | Date | Lines |
---|
2840.1 | I fear the worst. | IOSG::CHINNICK | gone walkabout | Thu Jun 10 1993 16:51 | 57 |
|
Hi Laura,
well... I don't have to tell you that there is not a lot to go on, but
I'll have a quick stab at explaining what might be happening.
If the user has been getting multiple copies of MAIL, it suggests that
there is something going wrong with attempts to deliver the MAIL. Most
likely, the MAIL Fetcher is crashing out and then when it restarts for
its next run it tries to deliver the MAIL again. This will happen until
the retry limit is hit.
The fact that the file(s) have disappeared is an indication that the
usage count stored in the SDAF was wrong. This is again consistent with
some failure during delivery whereby the count has not been correctly
set - again possibly because the Fetcher has crashed out or because of
some other corruption.
You can check your system retry limits by getting the system symbols:
<GET SYS$OA$MTI_FETCHER_RETRY_LIMIT
<GET SYS$OA$MTI_SENDER_RETRY_LIMIT
This might just match up with the number of copies of the mail the user
is receiving.
I'd also have a quick look at the OAMTIMAIL batch logs although these
were probably purged out of existence if this problem is more than 10
minutes old. Instead, you can look in the OA$MTI_ERR file and
OA$MTI_LOG file to see when the mail message was processed and whether
you have been getting ACCVIO's and the like.
If the OA$MTI_ERR file is full of errors - and it's not uncommon on
V2.4 systems - then it often reflects a lot of corruption in your SDAFs
and sometimes the POSTMASTER account. It's usually a good idea to clean
this account's cabinet out periodically as it can clog up when you get
system failures. The PENDING file is ther other file which is very
sensitive to corruptions. Finding the corruption is very difficult
without a lot of specialist knowledge.
CSC's have tools which can assist with this type of activity but
unfortunately, they're not generally available internally because they
can easily wipe your system out if used improperly.
One unfortunate consequence of these types of problems is that
sometimes they aren't directly related to the documents at hand. If you
get a bad NBS file off an unpatched system it can sometimes cause a
creeping sort of corruption - a bit like cancer really. And when you
get an error it might be a later message in the same Fetcher run.
Things can get very complex.
K602 had one major fix to help with this by making the file cabinet
code more robust, but even with that you can still get problems. I'm
not too sure if any further patches help - I've been out of the loop
for too long - but someone else might be able to comment.
Paul.
|
2840.2 | I fear the unknown... | TROOA::PIGGOT | | Thu Jun 10 1993 19:48 | 29 |
| Paul -
Thanks for your quick reply on this. Interestingly enough, the sender
and fetcher retry limits are actually set to 5 - the user only received
two copies of the messages in question.
I had already examined the oa$mti_err file, and although there are plenty
of errors, one type in particular does stand out that took place over the
two day period, that doesn't show up on any of the other Canadian ALL-IN-1
sites:
26-MAY-1993 11:22:20 %MROUTER-F-PROTOFAIL, Protocol violation!AS
26-MAY-1993 11:22:20 -SYSTEM-F-PATHLOST, path to network
partner node lost
26-MAY-1993 14:43:34 %MROUTER-F-PROTOFAIL, Protocol violation!AS
26-MAY-1993 14:43:35 -SYSTEM-F-LINKABORT, network partner
aborted logical link
I don't know if there is any significance to these errors, but what I
am primarily concerned with right now is determining whether I am
dealing with an isolated incident, and/or if I have a serious problem
here (if so, how can it be analyzed?).
Regards,
Laura
|
2840.3 | I don't think it's the fetcher | FORTY2::ASH | Grahame Ash @REO | Fri Jun 11 1993 10:43 | 17 |
| .0 says that the received documents are identical - the user actually has 2
pointers to the same shared message. I'd say that that moves the focus away
from delivery - which only delivers to the pending file, not the File Cab -
and to the user's own FileCab. It appears that get_pending has brought in the
same entry twice.
I've no idea what could cause that to happen! If it's only 1 user, then that
implies it's not a problem with the whole of the pending file - though perhaps
there's a minor corruption in this user's record? You could try deleting it.
The other file involved is the DOCDB - perhaps there was a transient problem
writing the new record, so it was retried? has the user got enough disc quota?
You could try FC FCO to create a new DOCDB.
Not much, sorry!
g
|
2840.4 | Thanks for correcting me Grahame. | IOSG::CHINNICK | gone walkabout | Fri Jun 11 1993 12:33 | 37 |
|
Laura - if you fear the unknown - then ALL-IN-1 is going to scare you
to death! :-)
Grahame might well be right - I must admit to having [foolishly]
overlooked the fact that the filespec was the same. It could still be
that Fetcher has created 2 pending entries for the same user but it
does seem less likely than some problem with GET_PENDING. Of course,
this type of problem could also arise from a corrupt NBS file with
actual or apparent duplicate addressees - the possibilites are endless.
I think that you could reasonably assume that if there is only one user
being affected and they only got 2 copies then you have not got too
serious a problem. If they got a copy on each retry or ALL messages -
then it would be more serious.
That is not to say that it wont recur, but we can't really say from
this information what happened. These MR errors (I understand from talking
to my colleagues) usually relates to some attempt to continue an MR
'session' after some other error has cause Fetcher to shutdown its
link. What we can't see is what this previous error might have been,
but disk space or other environment problems are usually the best bet.
You might get a hint from any preceding messages. Generally, they are
unlikely to be related to a problem with GET_PENDING though.
In the MTI logs, the most serious errors are those without such nice
meaningful information. MR interface errors are not as likely to cause
multiple delivery as Fetcher errors like VM problems or ACCVIOs. And if
the filespec is the same, then GET_PENDING would be the place to look
at.
But the place to concentrate would be the user's PENDING record and
DOCDB. Could anything have happened during a GET_PENDING or while the
user was entering ALL-IN-1 or EM ??
Paul.
|
2840.5 | Code looks OK | IOSG::CHINNICK | gone walkabout | Fri Jun 11 1993 12:59 | 15 |
|
A further quick check of the code reveals that you should get an error
message if I/O to the PENDING file fails. Did the user see any errors
flash up at any time in the MAIL subsystem?
Did the users process get killed or did the system crash while they
were doing MAIL operations?
It has to be some problem like this. It may be worthwhile just running
an $ ANALYZE/RMS/CHECK over PENDING.DAT when ALL-IN-1 is down to ensure
that you haven't got any RMS structure errors.
Other than that - no ideas. But it probably wont come back.
Paul.
|
2840.6 | We'll drive on then... | TROOA::PIGGOT | | Fri Jun 11 1993 14:37 | 20 |
| Grahame and Paul -
Thanks for all of your help so far. Based on what both of you have
said, this would seem to be a problem at the user level rather than
the system level. I will check with the user to make sure that we have
all of the information (i.e. did they see any other strange behavior,
system crash, disk quota problems etc.) at the time these problems
started occurring.
If your opinions are that this feels like an isolated incident
rather than something more serious, we will concentrate on trying to
retrieve his missing files, and continuing to monitor for other
problems. I think that I was looking for reassurance that this might
not be the start of something major.
I will post any other information that comes to light on this problem.
Thanks again,
Laura
|
2840.7 | Maybe known, but not fixed | PRSSOS::PROT | | Thu Jun 17 1993 12:19 | 37 |
|
This problem isn't so strange for me, because I have a customer who
experiment it since 3 years. I know that it also occured once in the
states (SRPed).
The cause it that for an unknown reason, the GET_PENDING fail during
its work.
Then It has created some new entries in the DOCDB, and as it fails, it
doesn't remove any pointer in the pending record. Then, the next II (or
GET_PENDING) will still create a new DOCDB record for the same pointers.
If the failure occurs 3 times, you will have 3 docdb records for the
messages corresponding to the first pointers in your pending record.
But, and here is the major problem, your usage has only been
incremented by one in the SDAF, because it is incremented when a
pointer is added to the pending record. If the user deletes the
duplicate messages from it's docdb more than the usage count and before
the next TRM, it can of course make the file deleted from the shared
directory .
This problem has been CLDed, and we worked a lot on it with Alan
Cottingham without success. A patch is on site which record any problem
during the get_pending, but always, when the problem occurs it is
silent !
I suspect diskquota problem during expanding DOCDB, but this error is
normally checked bythe GET_pending routine.
We are still in work on that.
Regards
Louis
|
2840.8 | Yes - I know the USAGE COUNT problem | IOSG::CHINNICK | gone walkabout | Thu Jun 17 1993 12:36 | 32 |
|
Louis,
This is exactly what I concluded looking at the code.
The usage count is not updated if the GET_PENDING fails to complete. We
know about this.
It explains why the documents disappeared here because there were 2
references and when one was deleted, usage count went to zero. The real
question is how and why did GET_PENDING fail?
I know that this and similar problems have been seen before but no
cause could ever be traced.
As I said in my earlier reply... the most probable cause is by failure
of the process or system during the GET_PENDING operation. This problem
is therefore more likely to occur on systems with "process killers"
running or with lots of system crashes (software or hardware). In such
cases, no diagnostics will ever be produced.
Failing that, there might be an error being signalled but it might only
be visible in the message window with GOLD\W depending upon whether any
other messages (or OA$MSG_PURGE) appear after it. If this is the case,
then only setting up tracing of messages could help detect it.
That's always been the problem with ALL-IN-1 - no journalling or
recovery and barely any error logging. In "abnormal" circumstances,
these are the problems which arise.
Paul.
|
2840.9 | no killer | PRSSOS::PROT | | Thu Jun 17 1993 14:09 | 26 |
| Paul,
No process killer in the cases I know because the problem occurs
during an II done by an interacive user. I the process was killed the
user could know that !
We provided (with Alan) a patch which as I said, record in OA$MTI_ERR
every error occuring during get_pending with the name of the user doing
it. (These errors are named *FMTPROC*)
The most understandable is that each time such a record was found in
OA$MTI_ERR, no duplicate problem occurs for the user because the error
is correctly handled by get_pending routine and the pending pointers
read before the error are cleared from the pending record.
BUT, when a real duplicate problem occurs, no trace is found in
OA$MTI_ERR....
I suspect either the trace code isn't sufficient, or the error occur
deeper in the sub-routines called by get_pending. That is the direction
where we have to investigate now. I will contact Alan today.
Louis
|
2840.10 | Variation on a theme | FLEX7::ALLINGHAM_PD | Permenantly Peaking! | Fri Jun 18 1993 18:39 | 18 |
| I had a similar problem to this one last week. A user mailed another
user on a remote system. The remote user had had his account
temporarily set to NO MAIL as this is how they deal with things when
someone goes on extended leave (what's wrong with auto reply I hear you
cry!).
In this case, the user received two identical delivery failures both
dated at the same time with the diagnositic that the remote user was
set to NO MAIL.
I haven't had a chance to look at the log files yet as there other
things a head of it.
Regards,
Peter.
P.S. It's 2.4.
|
2840.11 | RE: .10 | IOSG::SHOVE | Dave Shove -- REO2-G/M6 | Mon Jun 21 1993 12:12 | 9 |
| RE: .10
Is it possible that the user (with NO MAIL set) was on the distribution
twice (appearing in two distribution lists, for example)? Whilst
ALL-IN-1 tries not to deliver two copies of a message in these
circumstances, I don't think there's any logic to avoid generating two
delivery failures.
Dave.
|