[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:+ OpenVMS Clusters - The best clusters in the world! +
Notice:This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:PROXY::MOORE
Created:Fri Aug 26 1988
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5320
Total number of notes:23384

5254.0. "kernel stack no valid halt" by PRSSOS::MENICACCI () Thu Mar 13 1997 11:40

Cluster configuration :
---------------------



		VAX satellites + AXP satellites


         ================ Ethernet ==============
VAX 6510                                          AS2100
V5.5-2   ================ DSSI ================== V6.2-1H3
                    |
                   disk 0                             |
                                                      ======= SCSI 
							  disk1 disk2



VAX 6510 boots from disk0
AS2100 boots from disk1
Several AlphaStation 255 boots from disk2

One AlphaStation 200 doesn't succeed to boot from disk2.

AS200 starts to boot ... until,

%VMScluster-I-MSCPCONN, Connected to a MSCP server for the system disk, node
AS2100

halt cpu 0
halt code = 2
kernel stack no valid halt

PC = FFFFFFFF 80059DE0

At that time, disk2 was mounted only on AS2100.
The customer did mount this disk2 also on VAX 6510 and then, the satellite
booted OK.

Any advice is welcome.

Maria.

T.RTitleUserPersonal
Name
DateLines
5254.1What Are The Allocation Class Settings?XDELTA::HOFFMANSteve, OpenVMS EngineeringThu Mar 13 1997 15:3721
   What are the allocation classes of the AlphaServer 2100, the VAX 6000-510,
   and the DSSI disk ISE?  (This value should be non-zero, and should match
   across all three.)

   Be aware that since the AlphaServer 2100 must be in a non-zero disk
   allocation class to serve the DSSI disk, you will need to make sure that
   the disk unit numbers are unique across all disks in that allocation
   class across all nodes and all HSx controllers and all DSSI ISEs in that
   allocation class -- this includes both the DSSI disks *and* the local
   SCSI disks.  The unit numbers must be unique regardless of the device
   driver prefixes used: DU, DK, etc.

   To alter the disk unit numbers, it's likely easiest to use the DSSI ISE
   parameters to alter the disk unit number on the DSSI spindle, but some
   Alpha systems do include a mechanism to alter SCSI unit numbers via
   SYSGEN parameter.

   If you alter an allocation class, you will need/want to reboot all nodes
   in order to pick up the change consistently...

5254.2$1$dia0 and $1$dka0 and $1$dua0PRSSOS::MENICACCIFri Mar 21 1997 08:2912
	Thanks for your answer.

Customer will change these unit numbers.

> The unit numbers must be unique regardless of the device
>   driver prefixes used: DU, DK, etc.

Could you give some more details about the reason why ?

regards,

Maria.
5254.3Allocation Class and Unit Number Identify DevicesXDELTA::HOFFMANSteve, OpenVMS EngineeringFri Mar 21 1997 10:2313
:> The unit numbers must be unique regardless of the device
:>   driver prefixes used: DU, DK, etc.
:
:Could you give some more details about the reason why ?

   Because it is a requirement -- the lower levels of the VMScluster
   protocols look only at the allocation class and the unit number
   when identifying devices.  If there are devices with unit-number
   overlaps within a particular allocation class, some of the low-level
   VMScluster code can potentially become "confused" as to the uniqueness
   of these devices, with, uh, "interesting" results.

5254.4STAR::CROLLMon Mar 24 1997 09:4930
>>> The unit numbers must be unique regardless of the device
>>>   driver prefixes used: DU, DK, etc.
>>
>>Could you give some more details about the reason why ?

Steve's answer in .3 is mostly correct:  MSCP creates a unit identifier out of
the allocation class, unit number, and controller letter.  This is how the MSCP
server identifies a unit, and how DUDRIVER specifies a unit as a target for an
I/O operation.

The other reason this is so is that on the client system, DUDRIVER is the class
driver for the device, regardless of whether it's a DU unit or a DK unit. 
DUDRIVER is the only disk class driver that knows how to speak the MSCP
protocol, so it's the only class driver that can be used on a client system, and
so all disks that want to be distributed have to conform to DUDRIVER's rules.
And one of these rules is there must be a unique allocation class/unit number
for every device in the cluster.

If there are two devices that have the same allocation class/unit number
combination, DUDRIVER on the various client nodes will pick one, and all I/O
will go to that volume.  Which one is picked cannot be reliably predicted, and
will likely change as the cluster configuration changes due to state
transitions, nodes coming and going, and even transient events on the ethernet
or FDDI.  DUDRIVER on different clients may (probably will) make a different
choice -- client A will access one volume and client B will access the other,
even though they both may believe they're accessing the same one.  This can lead
to serious data corruption; one of the "interesting" results Steve mentioned in
.3.

John
5254.5VIVIAN::RANCEhttp://vivian.hhl.dec.com/rance/Tue Mar 25 1997 06:4322
  .4> If there are two devices that have the same allocation class/unit number
  .4> combination, DUDRIVER on the various client nodes will pick one, and all   
  .4> I/O will go to that volume. 

I once came across an interesting variant of this situation.

There were two disks in the VMSCluster which had the same Unit number and
allocation class.  Neither of these disks was served and the customer believed
that this meant they were not at risk.   System A could only access its local
$2$DUA24 and system B could only access a different local $2$DUA24.  There was
"no possibility" of data corruption because each drive was only accessed by one
system which knew what it was doing.

The fascinating XQP bugecheck was caused by the fact that the two systems were
both working on files with the same FID at the same time.  One system was in the
process of creating a new file using the unused FID, the other system was in the
process of opening an already existing file with the same FID on its disk.  The
system creating the file attempted to gain the appropriate lock for its newly
created file.  Discovered that another system had a lock on this "not-yet
existing" file and promptly bug-checked!

StuartR
5254.6No requirement for cluster-wide unique unit numbers.....STAR::CROLLFri Mar 28 1997 10:0223
This is my day for editorial corrections, I guess....

My reply in .4 is incorrect. There is no requirement that device unit numbers be
unique in a cluster as long as the allocation class rules are followed.  In
fact, on the STAR cluster, one of the VMS development clusters, there are two
devices named DKA200 as I write this.  Different allocation classes, of course.

The MSCP unit identifier contains, in addition to the allocation class, unit
number, and controller letter, the device type letters, the "dd" part of the
ddcu device name.  These are encoded in the 64-bit MSCP unit ID.  (You will
notice this if you look in the right places in the code, as I did this morning!)
This shows that you can have devices named DUA100 and DKA100 in the same cluster
at the same time, even within the same allocation class.

I also looked at the manual (What?  Not the manual!), and there's no requirement
there for unique unit numbers across the entire cluster.  Just unique names.
Unique SCSI unit numbers are (today, anyway) well-nigh impossible in any largish
cluster, and anyway, if there were unique unit numbers, there'd be no need for
allocation classes.

Sorry for any confusion my earlier reply caused.

John
5254.7Only problem in MOunt verification?KEIKI::WHITEMIN(2�,FWIW)Fri Mar 28 1997 14:437
    
    	John,
    
    	Isn't the problem with unique allocation/mscp unit numbers just
    in the mount verification routines? Did you look there?
    
               					Bill
5254.8VMSSG::FRIEDRICHSAsk me about Young EaglesMon Mar 31 1997 11:3220
    While this has no bearing on the problem in .0, I should mention
    that it appears that V7.1 introduced a "feature"...  It appears that
    there is an error in DUDRIVER that can lead to corruption of the 
    IO database if there are duplicate device/unit numbers (but different
    allocation classes).  As a result, device will not mount verify 
    correctly nor will they MSCP fail over correctly.  There were also
    problems reported with MOUNTing a device in this state (not
    surprising).
    
    The customer that IPMTed this problem has renumbered all of the disks
    so that they have unique unit numbers, and has not seen the problem
    since.  (CFS.49819)
    
    This problem is in V7.1 and V6.2 COMPAT and CLUSIO kits.  It will not
    be addressed in the upcoming DRIV01_071 kit, which will have other
    fixes for DUdriver and TUdriver.
    
    Cheers,
    jeff
    
5254.9EEMELI::MOSEROrienteers do it in the bush...Mon Mar 31 1997 14:125
    you mentioned DUDRIVER. Can I assume that it does not affect then
    SCSI devices, as this seems to be the more general cases with over-
    lapping unit numbers?
    
    /cmos
5254.10EVMS::MORONEYMon Mar 31 1997 14:421
DUDRIVER is involved when talking to MSCP served SCSI devices.
5254.11Previous Discussions Here (Somewhere)XDELTA::HOFFMANSteve, OpenVMS EngineeringMon Mar 31 1997 15:429
    
:    	Isn't the problem with unique allocation/mscp unit numbers just
:    in the mount verification routines? Did you look there?

   There was a long discussion on this a while back somewhere in this
   conference, and the limitation (from memory) was in one of the MSCP
   failure paths.  It may well be mount verification...  It was rather
   obscure, but it had been occasionally seen...

5254.12See 5259.5 for a new manefestation of non-unique unit numbers...XDELTA::HOFFMANSteve, OpenVMS EngineeringWed Apr 02 1997 11:390