T.R | Title | User | Personal Name | Date | Lines |
---|
2837.1 | forgot to tack this on the end.... | WHOS01::ELKIND | Steve Elkind, Digital SI @WHO | Thu Apr 03 1997 17:26 | 84 |
| Oops - forgot to add it on the end of my note----
*********************
********************* group710.log
*********************
************ dmqld (20997) 02-APR-1997 08:34:22 ************
ld, link receiver for group 710 has lost connection to group 1159
************ dmqld (20997) 02-APR-1997 08:34:22 ************
ld, link receiver for group 710 from group 1159 is exiting
************ dmqld (26040) 02-APR-1997 08:34:28 ************
ld, link sender for group 710 has lost connection to group 1159
************ dmqld (26040) 02-APR-1997 08:34:28 ************
ld, link sender for group 710 to group 1159 is exiting
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1105 is running
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, Remote node az07ae6s not found in local address data base
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1165 is running
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, operation failed to complete
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link listener for group 710 is exiting
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 is connected to group 1105
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1155 is running
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, Remote node tx14ie6s not found in local address data base
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 is connected to group 1165
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 is connected to group 1155
************ dmqld (6232) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1105 is running
************ dmqld (6232) 02-APR-1997 09:51:08 ************
ld, operation failed to complete
************ dmqld (6232) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1105 is exiting
************ dmqld (6233) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1165 is running
************ dmqld (6233) 02-APR-1997 09:51:08 ************
ld, operation failed to complete
************ dmqld (6233) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1165 is exiting
************ dmqld (6234) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1155 is running
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 has lost connection to group 1105
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1105 is exiting
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 has lost connection to group 1165
************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1165 is exiting
************ dmqld (6234) 02-APR-1997 09:52:08 ************
ld, link sender for group 710 is connected to group 1155
|
2837.2 | | XHOST::SJZ | Kick Butt In Your Face Messaging ! | Thu Apr 03 1997 17:45 | 9 |
|
I noticed you strategically left out the version number.
Given the format of the log entries it is V3.2 or earlier.
Regardless of what version they are running I would sug-
gest upgrading to V3.2A-1 and see if the problem goes
away.
_sjz.
|
2837.3 | sorry, the info was accidentally chopped off | WHOS01::ELKIND | Steve Elkind, Digital SI @WHO | Fri Apr 04 1997 00:01 | 31 |
| Actually, leaving out the version number was accidental, not a
strategem. An earlier draft stated that the back end is using v3.0c
running on Solaris, the front ends a mixture of 3.0c on HP-UX 9.04 and
3.2A-eco1 on HP-UX v10.20.
They can not upgrade the backends for another 3 months or so, as the
software they are built with is built with libraries based on 3.0B; the
version based on 3.2x has just entered development (and will deploy in
time for us to tell them "upgrade to v4.0" 8^{ ). It will be at
least a three month development/system test/acceptance test/integration
test cycle before it is allowed into production - possibly as much as
six months.
The front ends are in the middle of upgrading to v3.2A-eco1, as the
front end clients are built with the client library and so somewhat
divorced from the queueing engine version (most of the front end
applications are still built with v3.0B, and will continue to be so
until at least the fall is my guess).
I gather from another source that we can not start up the link listener
from the command line with v3.x, so we have no workaround. The
customer would like to get some idea of the cause to see if there is
anything he can do to avoid this event happening again (or perhaps to
detect it before it hits him again). He is doing some limited testing
using both 3.2A-eco1 and 3.0C on his test machines to see if he can
re-create the problem (and is also checking for symptoms of a memory
leak in the listener) on either one, but I suspect that he may not be
able to duplicate the conditions in that environment.
The customer is not looking for a fix, they know they won't get one,
all they want is any information that may be available.
|
2837.4 | | XHOST::SJZ | Kick Butt In Your Face Messaging ! | Fri Apr 04 1997 00:21 | 12 |
|
it isn't clear from the description or the logs what is
happening. and we have never had such a report in the
past. if it's reproducible then we have something to
go on, but it is not.
as for starting up the link listener on its own the an-
swer is no. we have special code that explicitly pre-
vents that.
_sjz.
|
2837.5 | possible explanation? | WHOS01::ELKIND | Steve Elkind, Digital SI @WHO | Fri Apr 04 1997 12:35 | 13 |
| The customer has found in testing that his link listener process size
grows at about 4-5 blocks per hour, when being hit repeatedly with
invalid cross-group connect requests from multiple sources. The
customer's current theory is that this is a long-term memory leak
problem, which can be solved by getting the invalid remote groups "off
the air". Neither myself nor the customer I work for buy this fully
(the group had been up for only about 4 days), but we will live with
this explanation for a while.
They haven't started their testing of 3.2A yet, other than to discover
that if they kill -9 the link listener all cross-group communication
stops (with 3.0C, currently running receivers and senders continue to
work).
|
2837.6 | | XHOST::SJZ | Kick Butt In Your Face Messaging ! | Fri Apr 04 1997 15:24 | 11 |
|
The memory leak you describe is a known problem with that
version. Upgrade to V3.2A-1.
>other than to discover that if they kill -9 the link listener all
>cross-group communication stops.
no duh. please tell this customer to refrain from using our prod-
uct. they are soiling it.
_sjz.
|
2837.7 | thank you for the information | WHOS01::ELKIND | Steve Elkind, Digital SI @WHO | Fri Apr 04 1997 23:55 | 17 |
| >>other than to discover that if they kill -9 the link listener all
>>cross-group communication stops.
>
>no duh. please tell this customer to refrain from using our prod-
>uct. they are soiling it.
Actually, they quite reasonably wanted to know how 3.2A would react
(versus 3.0) if it were to lose the link listener - obviously, not as
well. Then again, if it is a memory leak that caused the problem, then
they need not worry as much about losing the link listener. I'll pass
that on to them - thanks.
Maybe they are soiling the product, but at least they're buying it in
large quantities (RICH philistines!), and like it well enough to stake
their core business operations on its reliability. They just want to
squeeze out that last erg of reliability (and they're a pain in my neck
too at times).
|
2837.8 | yeah right - reasonable | XHOST::SJZ | Kick Butt In Your Face Messaging ! | Sun Apr 06 1997 00:51 | 21 |
|
>they quite reasonably wanted to know how 3.2A would react
>if it were to lose the link listener - obviously not as
>well.
actually it works better. in the V3.0 derivative they are
using if you lose your link listener the system is left in
some weird quasi state where non-determinism runs rampant
and the product is anything but reliable. the later ver-
sions detect the component failure and try to shutdown the
subsystem associated with that component. this provides a
deterministic behavior with which people can work.
maybe they should find out how their operating system will
react when they do a kill -9 on the init process. make sure
they log in as root then have them execute the following
command.
# kill -9 1
_sjz.
|