[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference pamsrc::decmessageq

Title:	NAS Message Queuing Bus
Notice:	KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10
Moderator:	PAMSRC::MARCUSEN

Created:	Wed Feb 27 1991
Last Modified:	Thu Jun 05 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2898
Total number of notes:	12363

2793.0. "help! - dmqld caught signal 11 in Unix v3.0C, need to explain" by WHOS01::ELKIND (Steve Elkind, Digital SI @WHO) Fri Feb 28 1997 16:53

We have experienced a weird problem at my customer.
    Setup:
        local group: # 1104, v3.2A ECO 1, on HP-UX 10.20 with ServiceGuard
                     this group is the initiator for all xgroup connects,
                     and has xgroup verification turned ON (the group is
		     using a ServiceGuard "floating IP")
        remote groups:  5 groups running v3.0C on Solaris - 700, 710,..740
			  (these 5 groups have verification turned on)
                     bunch of other groups running v3.0B on HP
                     one other group running v3.0A on OpenVMS/Alpha

    Problem: All 5 groups 7xx have "ld, link receiver for group 710 from
		group 1104 is running" followed immediately by "ld, caught
		signal 11" for that pid.
		After this, we start having log file entries about duplicate
	        link receiver for group 1104.  However the processes whose
		pids are shown in the dmqmonc link detail screen for any of
		these links do not exist (obvious, once you find the
		"caught signal 11" entry in log).
	     Group 1104 log shows repeated "ld, operation failed to complete"
		entries for ALL link senders EXCEPT group 7xx senders, for
		which the only log file entries are the initial "sender
		running" entries. (Note: the causes of the other links
		not coming up were later diagnosed and fixed temporarily by
		turning xgroup verification off).
	     Problem can only be cleared by dropping all 5 7xx groups and
		their attached production server applications.  Because of
		this, we will be getting an urgent demand for a root cause
		analysis of how this condition occurred, since it caused an
		unscheduled production outage in a non-stop application.

Any clues as to why the signal 11 happened on all 5 groups?
Is this possibly due to some known problem with v3.0B/C?  If so, is this
   fixed by v3.2A ECO 1?
I could not find any mention in the v3.1 or v3.2x release notes.
Is there some way to recover this without bouncing the group?

(log files group1104.log and group710.log are at WHOS01""::)

T.R	Title	User	Personal Name	Date	Lines
2793.1		XHOST::SJZ	Kick Butt In Your Face Messaging !	`Sat Mar 08 1997 14:00`	15
	try turning off xgroup verification. make certain the configurations have adequate min and max group numbers. try setting the min group number to 1. as for our "getting an urgent demand for a route cause analysis", they can can demand whatever they want. and though i am not the problem management person, the response should be upgrade to V3.2A-1 all around. V3.0C is not even a real release and is fully two ver- sions back (3 if you consider V3.2A as being a major release (which it really was even though we called it a MUP)). _sjz.
2793.2	link driver questions	WHOS01::ELKIND	Steve Elkind, Digital SI @WHO	`Wed Mar 12 1997 18:03`	29
	We were able to get this same situation to occur regularly on a cross-group link. By turning xgroup verify off, we found that there was a name mismatch - "xyz.foo.bar.com" in the DmQ v3.0C (non-initiator) side's xgroup table and DNS, and "xyz" in the v3.2A (initiator) side's xgroup entry for itself. On the 3.0C side we were getting "host xyz port xxxxx not found in local address database", which we were not getting before. Removing the domain name from the v3.0C table entry seemed to fix the problem. However, other users have said "why now? we've been working successfully up to now as things were". I went and tried to recreate the problem using various combinations of 3.0C, 3.0B, 3.2A (non-eco1 and eco1), host names unqualified/qualified, floating IPs and non-floating IPs, etc. - and have not recreated the problem (although I am running the non-initiator end on HP-UX instead of Solaris). Is it possible we are the victim of some sort of race condition in the 3.0C link driver? I also tried to recreate the dmqld segfault condition reported fixed in the 3.2A ECO1 release notes, with no success. I have noticed with the 3.2A non-eco1 non-initiator side that the symptoms of an unconfigured group connection cycled back and forth on every other connect attempt, with one set just containing "lost connection" and "exiting" messages, while the other set of symptoms also contained "unconfigured connection" (or something like that, the log is elsewhere now), and "spi, semaphore operation failed" as well. Was the problem fixed dependent on conditions? Thanks.