[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:*OLD* ALL-IN-1 (tm) Support Conference
Notice:Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:IOSG::PYE
Created:Thu Jan 30 1992
Last Modified:Tue Jan 23 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:4343
Total number of notes:18308

2840.0. "Multiple Copies of Mail Received" by TROOA::PIGGOT () Thu Jun 10 1993 14:42

	I have a user with a problem that has me stumped.  Thus far, he 
	is the only one that has reported this type of strange behavior, 
	and the symptoms do not seem easily explained (at least by me!).

	Within a particular 2-day period, this user received at least 
	three mail messages twice.  The Message Router ids are identical 
	(they are all remote mail messages), as are the shared directory 
	file specifications.  In all respects these messages are identical 
	except for document number.  The distribution lists are not 
	particularly large, nor did they originate from the same source.

	The following day (according to him), message bodies that used 
	to exist seemed to vanish for no reason.  Roughly 6 mail messages
	have headers but no bodies (the files do not exist on disk), and 
	the shared directories in which these files should be are scattered 
	across 6 different disks, which seems to rule out disk failure.  
	All of these messages were received within the same 2-day period, and
	we have not run an FCVR during this time. 

	The only thing that I can think of that could possibly cause these 
	types of symptoms would be restoring a older docdb.dat that still 
	had pointers to files that no longer existed on disk (i.e. the 
	user deleted them previously), but this is not apparently the case, 
	nor can I understand how that might cause multiple copies of mail 
	messages to be received. 

	This is an internal Digital site (in Canada), running ALL-IN-1 
	V2.4 patched to K602, with Message Router V3.2.  Can anyone think 
	of what could possibly cause this type of behavior?  I have run 
	out of alternatives.

	Thanks in advance,

	Laura Piggot,
	Canadian MTS Manager

    
T.RTitleUserPersonal
Name
DateLines
2840.1I fear the worst.IOSG::CHINNICKgone walkaboutThu Jun 10 1993 16:5157
    
    Hi Laura,
    
    well... I don't have to tell you that there is not a lot to go on, but
    I'll have a quick stab at explaining what might be happening.
    
    If the user has been getting multiple copies of MAIL, it suggests that
    there is something going wrong with attempts to deliver the MAIL. Most
    likely, the MAIL Fetcher is crashing out and then when it restarts for
    its next run it tries to deliver the MAIL again. This will happen until
    the retry limit is hit.
    
    The fact that the file(s) have disappeared is an indication that the
    usage count stored in the SDAF was wrong. This is again consistent with
    some failure during delivery whereby the count has not been correctly
    set - again possibly because the Fetcher has crashed out or because of
    some other corruption.
    
    You can check your system retry limits by getting the system symbols:
    
    	<GET SYS$OA$MTI_FETCHER_RETRY_LIMIT
    	<GET SYS$OA$MTI_SENDER_RETRY_LIMIT
    
    This might just match up with the number of copies of the mail the user
    is receiving.
    
    I'd also have a quick look at the OAMTIMAIL batch logs although these
    were probably purged out of existence if this problem is more than 10
    minutes old. Instead, you can look in the OA$MTI_ERR file and
    OA$MTI_LOG file to see when the mail message was processed and whether
    you have been getting ACCVIO's and the like.
    
    If the OA$MTI_ERR file is full of errors - and it's not uncommon on
    V2.4 systems - then it often reflects a lot of corruption in your SDAFs
    and sometimes the POSTMASTER account. It's usually a good idea to clean
    this account's cabinet out periodically as it can clog up when you get
    system failures. The PENDING file is ther other file which is very
    sensitive to corruptions. Finding the corruption is very difficult
    without a lot of specialist knowledge.
    
    CSC's have tools which can assist with this type of activity but
    unfortunately, they're not generally available internally because they
    can easily wipe your system out if used improperly.
    
    One unfortunate consequence of these types of problems is that
    sometimes they aren't directly related to the documents at hand. If you
    get a bad NBS file off an unpatched system it can sometimes cause a
    creeping sort of corruption - a bit like cancer really. And when you
    get an error it might be a later message in the same Fetcher run.
    Things can get very complex.
    
    K602 had one major fix to help with this by making the file cabinet
    code more robust, but even with that you can still get problems. I'm
    not too sure if any further patches help - I've been out of the loop
    for too long - but someone else might be able to comment.
    
    Paul.
2840.2I fear the unknown...TROOA::PIGGOTThu Jun 10 1993 19:4829
    Paul -
    
    Thanks for your quick reply on this.  Interestingly enough, the sender
    and fetcher retry limits are actually set to 5 - the user only received
    two copies of the messages in question.  
    
    I had already examined the oa$mti_err file, and although there are plenty 
    of errors, one type in particular does stand out that took place over the 
    two day period, that doesn't show up on any of the other Canadian ALL-IN-1
    sites:
    
    26-MAY-1993 11:22:20           %MROUTER-F-PROTOFAIL, Protocol violation!AS
    26-MAY-1993 11:22:20           -SYSTEM-F-PATHLOST, path to network
    partner node lost
    
    26-MAY-1993 14:43:34           %MROUTER-F-PROTOFAIL, Protocol violation!AS
    26-MAY-1993 14:43:35           -SYSTEM-F-LINKABORT, network partner 
    aborted logical link
    
    I don't know if there is any significance to these errors, but what I
    am primarily concerned with right now is determining whether I am
    dealing with an isolated incident, and/or if I have a serious problem
    here (if so, how can it be analyzed?).
    
    Regards,
    
    Laura
    
    
2840.3I don't think it's the fetcherFORTY2::ASHGrahame Ash @REOFri Jun 11 1993 10:4317
.0 says that the received documents are identical - the user actually has 2 
pointers to the same shared message. I'd say that that moves the focus away 
from delivery - which only delivers to the pending file, not the File Cab - 
and to the user's own FileCab. It appears that get_pending has brought in the 
same entry twice.

I've no idea what could cause that to happen! If it's only 1 user, then that 
implies it's not a problem with the whole of the pending file - though perhaps 
there's a minor corruption in this user's record? You could try deleting it.

The other file involved is the DOCDB - perhaps there was a transient problem
writing the new record, so it was retried? has the user got enough disc quota? 
You could try FC FCO to create a new DOCDB.

Not much, sorry!

g
2840.4Thanks for correcting me Grahame.IOSG::CHINNICKgone walkaboutFri Jun 11 1993 12:3337
    
    Laura - if you fear the unknown - then ALL-IN-1 is going to scare you
    to death! :-)

    Grahame might well be right - I must admit to having [foolishly]
    overlooked the fact that the filespec was the same. It could still be
    that Fetcher has created 2 pending entries for the same user but it
    does seem less likely than some problem with GET_PENDING. Of course,
    this type of problem could also arise from a corrupt NBS file with
    actual or apparent duplicate addressees - the possibilites are endless.
    
    I think that you could reasonably assume that if there is only one user
    being affected and they only got 2 copies then you have not got too
    serious a problem. If they got a copy on each retry or ALL messages -
    then it would be more serious.

    That is not to say that it wont recur, but we can't really say from
    this information what happened. These MR errors (I understand from talking
    to my colleagues) usually relates to some attempt to continue an MR
    'session' after some other error has cause Fetcher to shutdown its
    link. What we can't see is what this previous error might have been,
    but disk space or other environment problems are usually the best bet.
    You might get a hint from any preceding messages. Generally, they are
    unlikely to be related to a problem with GET_PENDING though.
    
    In the MTI logs, the most serious errors are those without such nice
    meaningful information. MR interface errors are not as likely to cause
    multiple delivery as Fetcher errors like VM problems or ACCVIOs. And if
    the filespec is the same, then GET_PENDING would be the place to look
    at.
    
    But the place to concentrate would be the user's PENDING record and
    DOCDB. Could anything have happened during a GET_PENDING or while the
    user was entering ALL-IN-1 or EM ??
    
    Paul.
    
2840.5Code looks OKIOSG::CHINNICKgone walkaboutFri Jun 11 1993 12:5915
    
    A further quick check of the code reveals that you should get an error
    message if I/O to the PENDING file fails. Did the user see any errors
    flash up at any time in the MAIL subsystem?
    
    Did the users process get killed or did the system crash while they
    were doing MAIL operations?
    
    It has to be some problem like this. It may be worthwhile just running
    an $ ANALYZE/RMS/CHECK over PENDING.DAT when ALL-IN-1 is down to ensure
    that you haven't got any RMS structure errors.
    
    Other than that - no ideas. But it probably wont come back.
    
    Paul.
2840.6We'll drive on then...TROOA::PIGGOTFri Jun 11 1993 14:3720
    Grahame and Paul -
    
    Thanks for all of your help so far.  Based on what both of you have
    said, this would seem to be a problem at the user level rather than
    the system level.  I will check with the user to make sure that we have
    all of the information (i.e. did they see any other strange behavior,
    system crash, disk quota problems etc.) at the time these problems
    started occurring.
    
    If your opinions are that this feels like an isolated incident
    rather than something more serious, we will concentrate on trying to
    retrieve his missing files, and continuing to monitor for other
    problems.  I think that I was looking for reassurance that this might
    not be the start of something major.
    
    I will post any other information that comes to light on this problem.
    
    Thanks again,
    
    Laura
2840.7Maybe known, but not fixedPRSSOS::PROTThu Jun 17 1993 12:1937
    
    
    This problem isn't so strange for me, because I have a customer who
    experiment it since 3 years. I know that it also occured once in the
    states (SRPed).
    
    The cause it that for an unknown reason, the GET_PENDING fail during
    its work.
    
    Then It has created some new entries in the DOCDB, and as it fails, it
    doesn't remove any pointer in the pending record. Then, the next II (or
    GET_PENDING) will still create a new DOCDB record for the same pointers. 
    
    If the failure occurs 3 times, you will have 3 docdb records for the 
    messages corresponding to the first pointers in your pending record.
    But, and here is the major problem, your usage has only been
    incremented by one in the SDAF, because it is incremented when a
    pointer is added to the pending record. If the user deletes the
    duplicate messages from it's docdb more than the usage count and before
    the next TRM, it can of course make the file deleted from the shared
    directory .
    
    
    This problem has been CLDed, and we worked a lot on it with Alan
    Cottingham without success. A patch is on site which record any problem
    during the get_pending, but always, when the problem occurs it is
    silent !
    
    I suspect diskquota problem during expanding DOCDB, but this error is
    normally checked bythe GET_pending routine.
   
    
We are still in work on that.
    
    Regards
    Louis
    
2840.8Yes - I know the USAGE COUNT problemIOSG::CHINNICKgone walkaboutThu Jun 17 1993 12:3632
    
    Louis,
    
    This is exactly what I concluded looking at the code. 
    
    The usage count is not updated if the GET_PENDING fails to complete. We
    know about this.
    
    It explains why the documents disappeared here because there were 2
    references and when one was deleted, usage count went to zero. The real
    question is how and why did GET_PENDING fail?
    
    I know that this and similar problems have been seen before but no
    cause could ever be traced.
    
    As I said in my earlier reply... the most probable cause is by failure
    of the process or system during the GET_PENDING operation. This problem
    is therefore more likely to occur on systems with "process killers"
    running or with lots of system crashes (software or hardware). In such
    cases, no diagnostics will ever be produced.
    
    Failing that, there might be an error being signalled but it might only
    be visible in the message window with GOLD\W depending upon whether any
    other messages (or OA$MSG_PURGE) appear after it. If this is the case,
    then only setting up tracing of messages could help detect it.
    
    That's always been the problem with ALL-IN-1 - no journalling or
    recovery and barely any error logging. In "abnormal" circumstances,
    these are the problems which arise.
    
    Paul.
    
2840.9no killerPRSSOS::PROTThu Jun 17 1993 14:0926
    Paul,
    
    
     No process killer in the cases I know because the problem occurs
    during an II done by an interacive user. I the process was killed the
    user could know that !
    
     We provided (with Alan) a patch which as I said, record in OA$MTI_ERR
    every error occuring during get_pending with the name of the user doing
    it. (These errors are named *FMTPROC*)
    
     The most understandable is that each time such a record was found in
    OA$MTI_ERR, no duplicate problem occurs for the user because the error
    is correctly handled by get_pending routine and the pending pointers
    read before the error are cleared from the pending record.
    
     BUT, when a real duplicate problem occurs, no trace is found in
    OA$MTI_ERR....
    
     I suspect either the trace code isn't sufficient, or the error occur
    deeper in the sub-routines called by get_pending. That is the direction
    where we have to investigate now. I will contact Alan today.
    
    
    Louis 
         
2840.10Variation on a themeFLEX7::ALLINGHAM_PDPermenantly Peaking!Fri Jun 18 1993 18:3918
    I had a similar problem to this one last week.  A user mailed another
    user on a remote system.  The remote user had had his account
    temporarily set to NO MAIL as this is how they deal with things when
    someone goes on extended leave (what's wrong with auto reply I hear you
    cry!).
    
    In this case, the user received two identical delivery failures both
    dated at the same time with the diagnositic that the remote user was
    set to NO MAIL.
    
    I haven't had a chance to look at the log files yet as there other
    things a head of it. 
    
    Regards,
    
    Peter.
    
    P.S. It's 2.4.
2840.11RE: .10IOSG::SHOVEDave Shove -- REO2-G/M6Mon Jun 21 1993 12:129
    RE: .10
    
    Is it possible that the user (with NO MAIL set) was on the distribution
    twice (appearing in two distribution lists, for example)? Whilst
    ALL-IN-1 tries not to deliver two copies of a message in these
    circumstances, I don't think there's any logic to avoid generating two
    delivery failures.
    
    Dave.