Title: | NAS Message Queuing Bus |
Notice: | KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10 |
Moderator: | PAMSRC::MARCUS EN |
Created: | Wed Feb 27 1991 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 2898 |
Total number of notes: | 12363 |
We have experienced a weird problem at my customer. Setup: local group: # 1104, v3.2A ECO 1, on HP-UX 10.20 with ServiceGuard this group is the initiator for all xgroup connects, and has xgroup verification turned ON (the group is using a ServiceGuard "floating IP") remote groups: 5 groups running v3.0C on Solaris - 700, 710,..740 (these 5 groups have verification turned on) bunch of other groups running v3.0B on HP one other group running v3.0A on OpenVMS/Alpha Problem: All 5 groups 7xx have "ld, link receiver for group 710 from group 1104 is running" followed immediately by "ld, caught signal 11" for that pid. After this, we start having log file entries about duplicate link receiver for group 1104. However the processes whose pids are shown in the dmqmonc link detail screen for any of these links do not exist (obvious, once you find the "caught signal 11" entry in log). Group 1104 log shows repeated "ld, operation failed to complete" entries for ALL link senders EXCEPT group 7xx senders, for which the only log file entries are the initial "sender running" entries. (Note: the causes of the other links not coming up were later diagnosed and fixed temporarily by turning xgroup verification off). Problem can only be cleared by dropping all 5 7xx groups and their attached production server applications. Because of this, we will be getting an urgent demand for a root cause analysis of how this condition occurred, since it caused an unscheduled production outage in a non-stop application. Any clues as to why the signal 11 happened on all 5 groups? Is this possibly due to some known problem with v3.0B/C? If so, is this fixed by v3.2A ECO 1? I could not find any mention in the v3.1 or v3.2x release notes. Is there some way to recover this without bouncing the group? (log files group1104.log and group710.log are at WHOS01""::)
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
2793.1 | XHOST::SJZ | Kick Butt In Your Face Messaging ! | Sat Mar 08 1997 14:00 | 15 | |
try turning off xgroup verification. make certain the configurations have adequate min and max group numbers. try setting the min group number to 1. as for our "getting an urgent demand for a route cause analysis", they can can demand whatever they want. and though i am not the problem management person, the response should be upgrade to V3.2A-1 all around. V3.0C is not even a real release and is fully two ver- sions back (3 if you consider V3.2A as being a major release (which it really was even though we called it a MUP)). _sjz. | |||||
2793.2 | link driver questions | WHOS01::ELKIND | Steve Elkind, Digital SI @WHO | Wed Mar 12 1997 18:03 | 29 |
We were able to get this same situation to occur regularly on a cross-group link. By turning xgroup verify off, we found that there was a name mismatch - "xyz.foo.bar.com" in the DmQ v3.0C (non-initiator) side's xgroup table and DNS, and "xyz" in the v3.2A (initiator) side's xgroup entry for itself. On the 3.0C side we were getting "host xyz port xxxxx not found in local address database", which we were not getting before. Removing the domain name from the v3.0C table entry seemed to fix the problem. However, other users have said "why now? we've been working successfully up to now as things were". I went and tried to recreate the problem using various combinations of 3.0C, 3.0B, 3.2A (non-eco1 and eco1), host names unqualified/qualified, floating IPs and non-floating IPs, etc. - and have not recreated the problem (although I am running the non-initiator end on HP-UX instead of Solaris). Is it possible we are the victim of some sort of race condition in the 3.0C link driver? I also tried to recreate the dmqld segfault condition reported fixed in the 3.2A ECO1 release notes, with no success. I have noticed with the 3.2A non-eco1 non-initiator side that the symptoms of an unconfigured group connection cycled back and forth on every other connect attempt, with one set just containing "lost connection" and "exiting" messages, while the other set of symptoms also contained "unconfigured connection" (or something like that, the log is elsewhere now), and "spi, semaphore operation failed" as well. Was the problem fixed dependent on conditions? Thanks. |