T.R | Title | User | Personal Name | Date | Lines |
---|
4158.1 | Dump file analysis | VMSNET::P_NUNEZ | | Fri Feb 14 1997 11:02 | 17 |
| if it helps, license server dump analysis is returning:
Condition signalled to take dump:
%SYSTEM-F-ABORT, abort
%SYSTEM-F-ABORT, abort
-SYSTEM-S-NOMSG, Message number 0000F941
DBG> show calls
module name routine name line rel PC abs PC
*LOGGING PLog 3998 00000299 0000F941
IC 00000000 0000E848
IC 00000000 0000BF7B
SHARE$PWRK$CSSHR
00000000 00096085
MTS$MAIN 00000000 0000581F
00000000 8962038D
|
4158.2 | Questions | CPEEDY::KENNEDY | Steve Kennedy | Fri Feb 14 1997 13:04 | 24 |
| Paul-
.0> Customer has several license server dump files. The license server
.0> logs report the error:
Is this an occasional occurance or is this happening all the time?
("all the time" meaning that the license server won't run at all)
Is the customer running PATHWORKS on both nodes of the cluster? If so,
is the license server configured to run on both nodes? If so, does the
license server always fail on both nodes? Or does it fail sometimes?
If sometimes, is it always/usually while trying to start up on one node
as a result of failing over from the other node?
I know it's a lot of work, but has the customer tried to trouble shoot
this by configuring for a single transport to see which one (or more)
fails?
\steve
|
4158.3 | update | VMSNET::P_NUNEZ | | Fri Feb 14 1997 14:53 | 96 |
| Steve,
.0> Customer has several license server dump files. The license server
.0> logs report the error:
>Is this an occasional occurance or is this happening all the time?
>("all the time" meaning that the license server won't run at all)
It appeared to be happening all the time. But (1) with version limit
of 5 I only have stuff from today and (2) the customer just upgraded to
v5.0E over the weekend. He did note he's had problems with the license
server since upgrading that required him to stop/restart it several
times before it worked (but see below for how he was managing his
license server in the cluster). So, based on that, and the fact that
I'm seeing one strange netbios license server name on a cluster running
v5.0E in our lab that is similar to what I saw on the customer's, I
gotta believe this is new to v5.0E.
> Is the customer running PATHWORKS on both nodes of the cluster? If so,
> is the license server configured to run on both nodes? If so, does the
> license server always fail on both nodes? Or does it fail sometimes?
> If sometimes, is it always/usually while trying to start up on one node
> as a result of failing over from the other node?
He has a dssi cluster of 2 VAX 4000-500A, hardware model type 453
(BIGBRD and CLONE) and is running PATHWORKS on them both. Yes it was
failing on both nodes.
I think your hunch about starting the license server on one node after
it's failed on the other is a good one. Due to misconceptions on their
part, they thought they should only run pwrk$license_s on node BIGBRD
(because that's the name the license server grabbed initially).
Because they weren't aware of the inhibit logical, they accomplished
this by running pwrk$license_shutdown on CLONE after PATHWORKS was
running on both nodes in the cluster. They could start PATHWORKS on
either node first (they didn't have a policy on this). And I would
think that if this is the issue, then the order that likely causes the
problem is:
start PATHWORKS on clone first (becomes active license server)
start PATHWORKS on bigbrd
stop license server on clone (license server fails over to bigbrd)
If it were the other way around, stopping the license server on clone
wouldn't cause any failover to occur.
>I know it's a lot of work, but has the customer tried to trouble shoot
>this by configuring for a single transport to see which one (or more)
>fails?
By the time we figured out how to get the license server started, the
customer wanted to leave it alone until Monday. I've got a cluster I'm
going to try to duplicate it on (in .0 I did show one strange netbios
name related to the license server exists on our cluster already).
We found our way around it when I noticed the pwrk$lbigbrd\20,
pwrk$lbigrd\43, and pwrk$ls\47 NETBIOS names still existed ($ mc
pcsa_claim_name /status) on bigbrd after stopping the license server on
bigbrd.
So I:
$ mc pcsa_claim_name /delete pwrk$lbigbrd
$ mc pcsa_claim_name /delete pwrk$lbigbrd\43
$ mc pcsa_claim_name /delete pwrk$ls\47
$ @sys$startup:pwrk$license_startup
and it worked. So it seems the netbios names (possibly just DECnet
netbios names) aren't being deleted when pwrk$license_s is stopped.
This would explain the license server log error "Name 'PWRK$LBIGBRD '
is in use by Another License Server".
Comments?
I still don't understand how those odd netbios names are getting
created? I checked the customer's license server log and state file
and they have the correct name of just BIGBRD. Same on our cluster.
Here's the one odd name that existed on our internal cluster (which
seemed to be running fine - no dumps/etc) that uses the license server
name PWRK$LALFPW1:
NetBIOS name Last Numb Status
PWRK$LALFPW1R01 50 11 04
In all cases where "R0n" is appended to the name, the last byte is 50.
On the customer's system I saw it had names for PWRK$LBIGBRDR01 -
PWRK$LBIGBRDR0M and all had a last byte of 50. When he viewed these
names from DOS with SHOW ASTAT BIGBRD, the names ended with a "P" (for
example, PWRK$LBIGBRDR02P). I don't see any names with a last byte of
50 when things are "normal".
I'm still dialed in if you need more info (but things are "normal" at
this point)...
Paul
|
4158.4 | More weirdness | VMSNET::P_NUNEZ | | Fri Feb 14 1997 15:06 | 5 |
|
Also note in .0 that we see on node CLONE that it's claimed the name
PWRK$LCLONE R01 (and others). But why isn't PWRK$LBIGBRD????
Paul
|
4158.5 | Account issue? | VMSNET::P_NUNEZ | | Fri Feb 14 1997 15:17 | 8 |
| Possibly another factor. The customer noted that it seemed he had to
run the pwrk$license_startup from the SYSTEM account even though he has
fully privileged VMS account. I was using FIELD account (with all
privs enabled) to stop/start the license server process. Could this be
a factor?
paul
|
4158.6 | PWRK$L<name>\4c ? | VMSNET::P_NUNEZ | | Fri Feb 14 1997 15:21 | 6 |
|
I'm trying to duplicate on our cluster. I'm seeing an additional
license server netbios name with a last byte of 4c. I don't see this
on the customers systems?
Paul
|
4158.7 | my thought is names aren't being deleted | CPEEDY::KENNEDY | Steve Kennedy | Fri Feb 14 1997 18:51 | 79 |
| .3> So it seems the netbios names (possibly just DECnet
.3> netbios names) aren't being deleted when pwrk$license_s is stopped.
.3> This would explain the license server log error "Name 'PWRK$LBIGBRD '
.3> is in use by Another License Server".
.3>
.3> Comments?
This was my suspicion. I remembered we ran into a problem like this in
our test lab, but I couldn't remember if it was while testing shipping
software or prototype software. In either case it looks like the
problem is now in the field. FWIW: when we saw it before it was DECnet
only.
.3> I still don't understand how those odd netbios names are getting
.3> created?
The odd names that you can now see were introduced recently as an
optimization to the license components' "PING client" functions.
Essentially these names are created and serve as a 'cache' of network
names which the LS (or LR) use to ping clients for license information.
Previously the license components created new names on the fly, which
turns out to be very expensive (time wise) - especially in the LR case
where the client is waiting in the middle of trying to establish a
connection with the file server while this is going on.
The "R01P" ("R01"+ASC(50)) you see in the names is just a four
character tag appended to a "PWRK$Lname" name base to create a unique
name (*). The first character of this tag indicates if the name is
associated with the license registrar ("R") or license server ("S").
The next two characters are actually an alpha-numeric counter used to
create multiple unique names, where either character may be "0"-"9",
"A"-"Z". The last of the four characters is "P" (Ascii(50)), indicating
a "Ping" end-point.
.4> Also note in .0 that we see on node CLONE that it's claimed the name
.4> PWRK$LCLONE R01 (and others). But why isn't PWRK$LBIGBRD????
_^_
(*) This is a registrar name, so the name base is formed using "PWRK$L"
plus the node name (ie in this case CLONE), so as not to conflict
with other LRs in a cluster). I believe the LS uses the LS name as
its name base when forming these names.
.5> Possibly another factor. The customer noted that it seemed he had to
.5> run the pwrk$license_startup from the SYSTEM account even though he has
.5> fully privileged VMS account. I was using FIELD account (with all
.5> privs enabled) to stop/start the license server process. Could this be
.5> a factor?
I can't think of a reason why this is a factor, but I won't dismiss it
as a possibility.
I did notice on my system that the LS groups names, "PWRK$LS...G" and
"PWRK$Lname...L", are not cleaned up when the license server is shut
down using PWRK$LICENSE_SHUTDOWN (though these leftovers shouldn't
cause the conflict the customer is seeing). I wonder if it might be a
timing thing where the failover happens too quickly and the name on the
other node of the cluster isn't cleaned up? That said, I would only
expect this to be a possibility true if there were changes in this
area, since we haven't seen this type of problem before with cluster
configurations.
.6> I'm trying to duplicate on our cluster. I'm seeing an additional
.6> license server netbios name with a last byte of 4c. I don't see this
.6> on the customers systems?
Ascii(4c) = "L". The "L" is a tag which the license server uses (in
addition to the other tags listed in Note 2479.4. I can't remember its
exact use off the top of my head, but I think this indicates some sort
of listener thread for the license server.
Let us know the results or any info you glean from your testing.
Also, this seems to be a problem which will require a code change
solution - probably should escalate.
\steve
|
4158.8 | New Features, eh? | VMSNET::P_NUNEZ | | Mon Feb 17 1997 09:39 | 18 |
| Steve,
> Let us know the results or any info you glean from your testing.
From your reply, our cluster is working normally and I was unable to
duplicate the customer's "duplicate name" problems by stopping/starting
license server many times...
> Also, this seems to be a problem which will require a code change
> solution - probably should escalate.
I had the customer run the gather info procedure and ftp the saveset to
me last Friday, but it didn't make it in tact; I'll have him send it on
tape, but is there anything else I should get?
Appreciate the help,
Paul
|
4158.9 | feature? we think so ;-) | CPEEDY::KENNEDY | Steve Kennedy | Mon Feb 17 1997 12:50 | 37 |
| Paul-
re: "New Features, eh?"
We thought so ;-) Here's why: When server-based licensing is being
used, caching NETBIOS names for use by the license registrar in pinging
the client will save ~3 seconds in the turn around time back to the
client (the three seconds it takes to claim a new network name that the
LR used to ping the client). In V6 things potentially get worse because
the the 3+ second delay will turn into ~5 seconds if WINS is being used
(due to the extra time to go to the name server). Caching network
names for this purpose allows us to eliminate this very long delay in
most cases.
Feature? ;-}
.8> [...] and I was unable to duplicate the customer's "duplicate name"
.8> problems by stopping/starting license server many times...
Someone will try to reproduce this here once we get a CLD.
I'm now wondering if this isn't just a timing issue during failover in
a cluster, where the conflict is caused by the license server's name
not being cleaned up quickly enough on one node before the other node
tries to claim it. The reason I'm leaning this way now is that the
"PWRK$Lname" didn't show up in the PCSA_CLAIM_NAME list, so it's not
like something just lost track and didn't clean-up the name. Since the
name isn't "hanging around" in the name tables, I'm asuming there must
have been some intermittent conflict.
.8> [...] anything else I should get?
I can't think of any other info that the customer's going to have that
you can ask for.
thanks,
\steve
|