[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference pamsrc::decmessageq

Title:NAS Message Queuing Bus
Notice:KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10
Moderator:PAMSRC::MARCUSEN
Created:Wed Feb 27 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2898
Total number of notes:12363

2793.0. "help! - dmqld caught signal 11 in Unix v3.0C, need to explain" by WHOS01::ELKIND (Steve Elkind, Digital SI @WHO) Fri Feb 28 1997 16:53

We have experienced a weird problem at my customer.
    Setup:
        local group: # 1104, v3.2A ECO 1, on HP-UX 10.20 with ServiceGuard
                     this group is the initiator for all xgroup connects,
                     and has xgroup verification turned ON (the group is
		     using a ServiceGuard "floating IP")
        remote groups:  5 groups running v3.0C on Solaris - 700, 710,..740
			  (these 5 groups have verification turned on)
                     bunch of other groups running v3.0B on HP
                     one other group running v3.0A on OpenVMS/Alpha

    Problem: All 5 groups 7xx have "ld, link receiver for group 710 from
		group 1104 is running" followed immediately by "ld, caught
		signal 11" for that pid.
		After this, we start having log file entries about duplicate
	        link receiver for group 1104.  However the processes whose
		pids are shown in the dmqmonc link detail screen for any of
		these links do not exist (obvious, once you find the
		"caught signal 11" entry in log).
	     Group 1104 log shows repeated "ld, operation failed to complete"
		entries for ALL link senders EXCEPT group 7xx senders, for
		which the only log file entries are the initial "sender
		running" entries. (Note: the causes of the other links
		not coming up were later diagnosed and fixed temporarily by
		turning xgroup verification off).
	     Problem can only be cleared by dropping all 5 7xx groups and
		their attached production server applications.  Because of
		this, we will be getting an urgent demand for a root cause
		analysis of how this condition occurred, since it caused an
		unscheduled production outage in a non-stop application.

Any clues as to why the signal 11 happened on all 5 groups?
Is this possibly due to some known problem with v3.0B/C?  If so, is this
   fixed by v3.2A ECO 1?
I could not find any mention in the v3.1 or v3.2x release notes.
Is there some way to recover this without bouncing the group?

(log files group1104.log and group710.log are at WHOS01""::)

T.RTitleUserPersonal
Name
DateLines
2793.1XHOST::SJZKick Butt In Your Face Messaging !Sat Mar 08 1997 14:0015
    
    try turning off xgroup verification.  make certain the
    configurations have adequate min and max group numbers.
    try setting the min group number to 1.
    
    as for our "getting an urgent demand for a route cause
    analysis",  they can can  demand  whatever  they  want.
    and though i am not the problem management person, the
    response should  be  upgrade  to  V3.2A-1  all  around.
    V3.0C is not even a real release and is fully two  ver-
    sions back (3 if you consider V3.2A as being  a  major
    release (which it really was even though we called  it
    a MUP)).
    
    _sjz.
2793.2link driver questionsWHOS01::ELKINDSteve Elkind, Digital SI @WHOWed Mar 12 1997 18:0329
    We were able to get this same situation to occur regularly on a
    cross-group link.  By turning xgroup verify off, we found that there
    was a name mismatch - "xyz.foo.bar.com" in the DmQ v3.0C
    (non-initiator) side's xgroup table and DNS, and "xyz" in the v3.2A
    (initiator) side's xgroup entry for itself.  On the 3.0C side we were
    getting "host xyz port xxxxx not found in local address database",
    which we were not getting before.  Removing the domain name from the
    v3.0C table entry seemed to fix the problem.
    
    However, other users have said "why now?  we've been working
    successfully up to now as things were".  I went and tried to recreate
    the problem using various combinations of 3.0C, 3.0B, 3.2A (non-eco1
    and eco1), host names unqualified/qualified, floating IPs and
    non-floating IPs, etc. - and have not recreated the problem (although I
    am running the non-initiator end on HP-UX instead of Solaris).  Is it
    possible we are the victim of some sort of race condition in the 3.0C
    link driver?
    
    I also tried to recreate the dmqld segfault condition reported fixed in the
    3.2A ECO1 release notes, with no success.  I have noticed with the 3.2A
    non-eco1 non-initiator side that the symptoms of an unconfigured group
    connection cycled back and forth on every other connect attempt, with
    one set just containing "lost connection" and "exiting" messages, while
    the other set of symptoms also contained "unconfigured connection" (or
    something like that, the log is elsewhere now), and "spi, semaphore
    operation failed" as well.  Was the problem fixed dependent on
    conditions?
    
    Thanks.