[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:	OLD ALL-IN-1 (tm) Support Conference
Notice:	Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:	IOSG::PYE

Created:	Thu Jan 30 1992
Last Modified:	Tue Jan 23 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4343
Total number of notes:	18308

2840.0. "Multiple Copies of Mail Received" by TROOA::PIGGOT () Thu Jun 10 1993 13:42

	I have a user with a problem that has me stumped.  Thus far, he 
	is the only one that has reported this type of strange behavior, 
	and the symptoms do not seem easily explained (at least by me!).

	Within a particular 2-day period, this user received at least 
	three mail messages twice.  The Message Router ids are identical 
	(they are all remote mail messages), as are the shared directory 
	file specifications.  In all respects these messages are identical 
	except for document number.  The distribution lists are not 
	particularly large, nor did they originate from the same source.

	The following day (according to him), message bodies that used 
	to exist seemed to vanish for no reason.  Roughly 6 mail messages
	have headers but no bodies (the files do not exist on disk), and 
	the shared directories in which these files should be are scattered 
	across 6 different disks, which seems to rule out disk failure.  
	All of these messages were received within the same 2-day period, and
	we have not run an FCVR during this time. 

	The only thing that I can think of that could possibly cause these 
	types of symptoms would be restoring a older docdb.dat that still 
	had pointers to files that no longer existed on disk (i.e. the 
	user deleted them previously), but this is not apparently the case, 
	nor can I understand how that might cause multiple copies of mail 
	messages to be received. 

	This is an internal Digital site (in Canada), running ALL-IN-1 
	V2.4 patched to K602, with Message Router V3.2.  Can anyone think 
	of what could possibly cause this type of behavior?  I have run 
	out of alternatives.

	Thanks in advance,

	Laura Piggot,
	Canadian MTS Manager

T.R	Title	User	Personal Name	Date	Lines
2840.1	I fear the worst.	IOSG::CHINNICK	gone walkabout	`Thu Jun 10 1993 15:51`	57
	Hi Laura, well... I don't have to tell you that there is not a lot to go on, but I'll have a quick stab at explaining what might be happening. If the user has been getting multiple copies of MAIL, it suggests that there is something going wrong with attempts to deliver the MAIL. Most likely, the MAIL Fetcher is crashing out and then when it restarts for its next run it tries to deliver the MAIL again. This will happen until the retry limit is hit. The fact that the file(s) have disappeared is an indication that the usage count stored in the SDAF was wrong. This is again consistent with some failure during delivery whereby the count has not been correctly set - again possibly because the Fetcher has crashed out or because of some other corruption. You can check your system retry limits by getting the system symbols: <GET SYS$OA$MTI_FETCHER_RETRY_LIMIT <GET SYS$OA$MTI_SENDER_RETRY_LIMIT This might just match up with the number of copies of the mail the user is receiving. I'd also have a quick look at the OAMTIMAIL batch logs although these were probably purged out of existence if this problem is more than 10 minutes old. Instead, you can look in the OA$MTI_ERR file and OA$MTI_LOG file to see when the mail message was processed and whether you have been getting ACCVIO's and the like. If the OA$MTI_ERR file is full of errors - and it's not uncommon on V2.4 systems - then it often reflects a lot of corruption in your SDAFs and sometimes the POSTMASTER account. It's usually a good idea to clean this account's cabinet out periodically as it can clog up when you get system failures. The PENDING file is ther other file which is very sensitive to corruptions. Finding the corruption is very difficult without a lot of specialist knowledge. CSC's have tools which can assist with this type of activity but unfortunately, they're not generally available internally because they can easily wipe your system out if used improperly. One unfortunate consequence of these types of problems is that sometimes they aren't directly related to the documents at hand. If you get a bad NBS file off an unpatched system it can sometimes cause a creeping sort of corruption - a bit like cancer really. And when you get an error it might be a later message in the same Fetcher run. Things can get very complex. K602 had one major fix to help with this by making the file cabinet code more robust, but even with that you can still get problems. I'm not too sure if any further patches help - I've been out of the loop for too long - but someone else might be able to comment. Paul.
2840.2	I fear the unknown...	TROOA::PIGGOT		`Thu Jun 10 1993 18:48`	29
	Paul - Thanks for your quick reply on this. Interestingly enough, the sender and fetcher retry limits are actually set to 5 - the user only received two copies of the messages in question. I had already examined the oa$mti_err file, and although there are plenty of errors, one type in particular does stand out that took place over the two day period, that doesn't show up on any of the other Canadian ALL-IN-1 sites: 26-MAY-1993 11:22:20 %MROUTER-F-PROTOFAIL, Protocol violation!AS 26-MAY-1993 11:22:20 -SYSTEM-F-PATHLOST, path to network partner node lost 26-MAY-1993 14:43:34 %MROUTER-F-PROTOFAIL, Protocol violation!AS 26-MAY-1993 14:43:35 -SYSTEM-F-LINKABORT, network partner aborted logical link I don't know if there is any significance to these errors, but what I am primarily concerned with right now is determining whether I am dealing with an isolated incident, and/or if I have a serious problem here (if so, how can it be analyzed?). Regards, Laura
2840.3	I don't think it's the fetcher	FORTY2::ASH	Grahame Ash @REO	`Fri Jun 11 1993 09:43`	17
	.0 says that the received documents are identical - the user actually has 2 pointers to the same shared message. I'd say that that moves the focus away from delivery - which only delivers to the pending file, not the File Cab - and to the user's own FileCab. It appears that get_pending has brought in the same entry twice. I've no idea what could cause that to happen! If it's only 1 user, then that implies it's not a problem with the whole of the pending file - though perhaps there's a minor corruption in this user's record? You could try deleting it. The other file involved is the DOCDB - perhaps there was a transient problem writing the new record, so it was retried? has the user got enough disc quota? You could try FC FCO to create a new DOCDB. Not much, sorry! g
2840.4	Thanks for correcting me Grahame.	IOSG::CHINNICK	gone walkabout	`Fri Jun 11 1993 11:33`	37
	Laura - if you fear the unknown - then ALL-IN-1 is going to scare you to death! :-) Grahame might well be right - I must admit to having [foolishly] overlooked the fact that the filespec was the same. It could still be that Fetcher has created 2 pending entries for the same user but it does seem less likely than some problem with GET_PENDING. Of course, this type of problem could also arise from a corrupt NBS file with actual or apparent duplicate addressees - the possibilites are endless. I think that you could reasonably assume that if there is only one user being affected and they only got 2 copies then you have not got too serious a problem. If they got a copy on each retry or ALL messages - then it would be more serious. That is not to say that it wont recur, but we can't really say from this information what happened. These MR errors (I understand from talking to my colleagues) usually relates to some attempt to continue an MR 'session' after some other error has cause Fetcher to shutdown its link. What we can't see is what this previous error might have been, but disk space or other environment problems are usually the best bet. You might get a hint from any preceding messages. Generally, they are unlikely to be related to a problem with GET_PENDING though. In the MTI logs, the most serious errors are those without such nice meaningful information. MR interface errors are not as likely to cause multiple delivery as Fetcher errors like VM problems or ACCVIOs. And if the filespec is the same, then GET_PENDING would be the place to look at. But the place to concentrate would be the user's PENDING record and DOCDB. Could anything have happened during a GET_PENDING or while the user was entering ALL-IN-1 or EM ?? Paul.
2840.5	Code looks OK	IOSG::CHINNICK	gone walkabout	`Fri Jun 11 1993 11:59`	15
	A further quick check of the code reveals that you should get an error message if I/O to the PENDING file fails. Did the user see any errors flash up at any time in the MAIL subsystem? Did the users process get killed or did the system crash while they were doing MAIL operations? It has to be some problem like this. It may be worthwhile just running an $ ANALYZE/RMS/CHECK over PENDING.DAT when ALL-IN-1 is down to ensure that you haven't got any RMS structure errors. Other than that - no ideas. But it probably wont come back. Paul.
2840.6	We'll drive on then...	TROOA::PIGGOT		`Fri Jun 11 1993 13:37`	20
	Grahame and Paul - Thanks for all of your help so far. Based on what both of you have said, this would seem to be a problem at the user level rather than the system level. I will check with the user to make sure that we have all of the information (i.e. did they see any other strange behavior, system crash, disk quota problems etc.) at the time these problems started occurring. If your opinions are that this feels like an isolated incident rather than something more serious, we will concentrate on trying to retrieve his missing files, and continuing to monitor for other problems. I think that I was looking for reassurance that this might not be the start of something major. I will post any other information that comes to light on this problem. Thanks again, Laura
2840.7	Maybe known, but not fixed	PRSSOS::PROT		`Thu Jun 17 1993 11:19`	37
	This problem isn't so strange for me, because I have a customer who experiment it since 3 years. I know that it also occured once in the states (SRPed). The cause it that for an unknown reason, the GET_PENDING fail during its work. Then It has created some new entries in the DOCDB, and as it fails, it doesn't remove any pointer in the pending record. Then, the next II (or GET_PENDING) will still create a new DOCDB record for the same pointers. If the failure occurs 3 times, you will have 3 docdb records for the messages corresponding to the first pointers in your pending record. But, and here is the major problem, your usage has only been incremented by one in the SDAF, because it is incremented when a pointer is added to the pending record. If the user deletes the duplicate messages from it's docdb more than the usage count and before the next TRM, it can of course make the file deleted from the shared directory . This problem has been CLDed, and we worked a lot on it with Alan Cottingham without success. A patch is on site which record any problem during the get_pending, but always, when the problem occurs it is silent ! I suspect diskquota problem during expanding DOCDB, but this error is normally checked bythe GET_pending routine. We are still in work on that. Regards Louis
2840.8	Yes - I know the USAGE COUNT problem	IOSG::CHINNICK	gone walkabout	`Thu Jun 17 1993 11:36`	32
	Louis, This is exactly what I concluded looking at the code. The usage count is not updated if the GET_PENDING fails to complete. We know about this. It explains why the documents disappeared here because there were 2 references and when one was deleted, usage count went to zero. The real question is how and why did GET_PENDING fail? I know that this and similar problems have been seen before but no cause could ever be traced. As I said in my earlier reply... the most probable cause is by failure of the process or system during the GET_PENDING operation. This problem is therefore more likely to occur on systems with "process killers" running or with lots of system crashes (software or hardware). In such cases, no diagnostics will ever be produced. Failing that, there might be an error being signalled but it might only be visible in the message window with GOLD\W depending upon whether any other messages (or OA$MSG_PURGE) appear after it. If this is the case, then only setting up tracing of messages could help detect it. That's always been the problem with ALL-IN-1 - no journalling or recovery and barely any error logging. In "abnormal" circumstances, these are the problems which arise. Paul.
2840.9	no killer	PRSSOS::PROT		`Thu Jun 17 1993 13:09`	26
	Paul, No process killer in the cases I know because the problem occurs during an II done by an interacive user. I the process was killed the user could know that ! We provided (with Alan) a patch which as I said, record in OA$MTI_ERR every error occuring during get_pending with the name of the user doing it. (These errors are named FMTPROC) The most understandable is that each time such a record was found in OA$MTI_ERR, no duplicate problem occurs for the user because the error is correctly handled by get_pending routine and the pending pointers read before the error are cleared from the pending record. BUT, when a real duplicate problem occurs, no trace is found in OA$MTI_ERR.... I suspect either the trace code isn't sufficient, or the error occur deeper in the sub-routines called by get_pending. That is the direction where we have to investigate now. I will contact Alan today. Louis
2840.10	Variation on a theme	FLEX7::ALLINGHAM_PD	Permenantly Peaking!	`Fri Jun 18 1993 17:39`	18
	I had a similar problem to this one last week. A user mailed another user on a remote system. The remote user had had his account temporarily set to NO MAIL as this is how they deal with things when someone goes on extended leave (what's wrong with auto reply I hear you cry!). In this case, the user received two identical delivery failures both dated at the same time with the diagnositic that the remote user was set to NO MAIL. I haven't had a chance to look at the log files yet as there other things a head of it. Regards, Peter. P.S. It's 2.4.
2840.11	RE: .10	IOSG::SHOVE	Dave Shove -- REO2-G/M6	`Mon Jun 21 1993 11:12`	9
	RE: .10 Is it possible that the user (with NO MAIL set) was on the distribution twice (appearing in two distribution lists, for example)? Whilst ALL-IN-1 tries not to deliver two copies of a message in these circumstances, I don't think there's any logic to avoid generating two delivery failures. Dave.