[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference lassie::ucx

Title:	DEC TCP/IP Services for OpenVMS
Notice:	Note 2-SSB Kits, 3-FT Kits, 4-Patch Info, 7-QAR System
Moderator:	ucxaxp.ucx.lkg.dec.com::TIBBERT

Created:	Thu Nov 17 1994
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5568
Total number of notes:	21492

5468.0. "load balancing troubles" by UTOPIE::FRUEHWIRTH_M () Fri Apr 25 1997 09:33

hello,

i've a customer who has trouble with UCX cluster load balancing.

config:

1x primary bind server (name: akhpns)
	UCX V4.1 ECO4, OpenVMS V6.1, VAXserver 3100
1x secondary bind server (name: akhsns)
	UCX V4.1, OpenVMS V6.1, VAXserver 3100

the bind-configuration and logs are available (too big to post it here):

CONSUL::DISK$USER3:[MARTIN_FRUEHWIRTH.DUMPS.AKH]
AKHPNS_UCX$BIND_STARTUP.LOG;1 3355  25-APR-1997 15:19:28.38
AKHSNS_UCX$BIND_STARTUP.LOG;1 2016  25-APR-1997 15:19:43.49
BIND_CONFIG.TXT;1               44  25-APR-1997 15:26:07.84

logicals set for logging:
UCX$BIND_CLUSTER_DBG_LEVEL 2
UCX$BIND_LOG_LEVEL 2
UCX$BIND_METRIV_DBG_LEVEL 2


the primary and the secondary bind server are serving resolver-requests
for a lot of UCX-clusters.

the problem we see is that the primary bind server stops offering
the correct (higher value) clustermember calculated by the metricserver.

we have tested with a cluster called akhtc2.arz.akh-wien.at, 
which two members are akht2a.arz.akh-wien.at and akht2b.arz.akh-wien.at.
both members have UCX V4.1 ECO 4.

we've changed the interactive login value at one member (akht2a) from 96 to 32 
back to 96 and so forth. 
the value at the other member (akht2b) is unchanged 64.
time between changes: more than 2 minutes.

with a periodic $mc ucx$metricview/host=akhtc2 on primary and secondary
bind server the metric value is calculated and updated correctly on both
servers for both clustermembers.

what we've seen is that the primary bind server stops offering the host
with the highest metric value to the resolvers after 3 to 5 times altering
the interactive login value at the cluster-member (akht2a).
reproducable at customers site.

also interesting is that the 'hanging' primary bind server offers 
the cluster-member (which previously had the lower metricvalue:akht2a) 
when we have killed the UCX$metric-process on akht2a.
in this state $mc ucx$metricview/host=akhtc2 shows -no response- for
akht2a, but the primary offers its address.
checked with $ucx ucx$nslookup and $ucx ping/num=1/all akhtc2. 

only restarting the bindserver resolves the situation.

the secondary bind server works correctly the whole time.  

for historical reason most of the 3000+ clients resolve to the
primary server, changing that is not what the customer wants;
shutting down the primary server for downgrading to noECO
is no solution for the next month -> resolver timeout 75 sec.,
has to be up and running for other reasons in this hospital (24x365),
very rigorous downtime slots.

Many thanks in advance for any help.
martin

T.R	Title	User	Personal Name	Date	Lines