[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:	+ OpenVMS Clusters - The best clusters in the world! +
Notice:	This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:	PROXY::MOORE

Created:	Fri Aug 26 1988
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5320
Total number of notes:	23384

5254.0. "kernel stack no valid halt" by PRSSOS::MENICACCI () Thu Mar 13 1997 11:40

Cluster configuration :
---------------------



		VAX satellites + AXP satellites


         ================ Ethernet ==============
VAX 6510                                          AS2100
V5.5-2   ================ DSSI ================== V6.2-1H3
                    |
                   disk 0                             |
                                                      ======= SCSI 
							  disk1 disk2



VAX 6510 boots from disk0
AS2100 boots from disk1
Several AlphaStation 255 boots from disk2

One AlphaStation 200 doesn't succeed to boot from disk2.

AS200 starts to boot ... until,

%VMScluster-I-MSCPCONN, Connected to a MSCP server for the system disk, node
AS2100

halt cpu 0
halt code = 2
kernel stack no valid halt

PC = FFFFFFFF 80059DE0

At that time, disk2 was mounted only on AS2100.
The customer did mount this disk2 also on VAX 6510 and then, the satellite
booted OK.

Any advice is welcome.

Maria.

T.R	Title	User	Personal Name	Date	Lines
5254.1	What Are The Allocation Class Settings?	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Thu Mar 13 1997 15:37`	21
	What are the allocation classes of the AlphaServer 2100, the VAX 6000-510, and the DSSI disk ISE? (This value should be non-zero, and should match across all three.) Be aware that since the AlphaServer 2100 must be in a non-zero disk allocation class to serve the DSSI disk, you will need to make sure that the disk unit numbers are unique across all disks in that allocation class across all nodes and all HSx controllers and all DSSI ISEs in that allocation class -- this includes both the DSSI disks and the local SCSI disks. The unit numbers must be unique regardless of the device driver prefixes used: DU, DK, etc. To alter the disk unit numbers, it's likely easiest to use the DSSI ISE parameters to alter the disk unit number on the DSSI spindle, but some Alpha systems do include a mechanism to alter SCSI unit numbers via SYSGEN parameter. If you alter an allocation class, you will need/want to reboot all nodes in order to pick up the change consistently...
5254.2	$1$dia0 and $1$dka0 and $1$dua0	PRSSOS::MENICACCI		`Fri Mar 21 1997 08:29`	12
	Thanks for your answer. Customer will change these unit numbers. > The unit numbers must be unique regardless of the device > driver prefixes used: DU, DK, etc. Could you give some more details about the reason why ? regards, Maria.
5254.3	Allocation Class and Unit Number Identify Devices	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Fri Mar 21 1997 10:23`	13
	:> The unit numbers must be unique regardless of the device :> driver prefixes used: DU, DK, etc. : :Could you give some more details about the reason why ? Because it is a requirement -- the lower levels of the VMScluster protocols look only at the allocation class and the unit number when identifying devices. If there are devices with unit-number overlaps within a particular allocation class, some of the low-level VMScluster code can potentially become "confused" as to the uniqueness of these devices, with, uh, "interesting" results.
5254.4		STAR::CROLL		`Mon Mar 24 1997 09:49`	30
	>>> The unit numbers must be unique regardless of the device >>> driver prefixes used: DU, DK, etc. >> >>Could you give some more details about the reason why ? Steve's answer in .3 is mostly correct: MSCP creates a unit identifier out of the allocation class, unit number, and controller letter. This is how the MSCP server identifies a unit, and how DUDRIVER specifies a unit as a target for an I/O operation. The other reason this is so is that on the client system, DUDRIVER is the class driver for the device, regardless of whether it's a DU unit or a DK unit. DUDRIVER is the only disk class driver that knows how to speak the MSCP protocol, so it's the only class driver that can be used on a client system, and so all disks that want to be distributed have to conform to DUDRIVER's rules. And one of these rules is there must be a unique allocation class/unit number for every device in the cluster. If there are two devices that have the same allocation class/unit number combination, DUDRIVER on the various client nodes will pick one, and all I/O will go to that volume. Which one is picked cannot be reliably predicted, and will likely change as the cluster configuration changes due to state transitions, nodes coming and going, and even transient events on the ethernet or FDDI. DUDRIVER on different clients may (probably will) make a different choice -- client A will access one volume and client B will access the other, even though they both may believe they're accessing the same one. This can lead to serious data corruption; one of the "interesting" results Steve mentioned in .3. John
5254.5		VIVIAN::RANCE	http://vivian.hhl.dec.com/rance/	`Tue Mar 25 1997 06:43`	22
	.4> If there are two devices that have the same allocation class/unit number .4> combination, DUDRIVER on the various client nodes will pick one, and all .4> I/O will go to that volume. I once came across an interesting variant of this situation. There were two disks in the VMSCluster which had the same Unit number and allocation class. Neither of these disks was served and the customer believed that this meant they were not at risk. System A could only access its local $2$DUA24 and system B could only access a different local $2$DUA24. There was "no possibility" of data corruption because each drive was only accessed by one system which knew what it was doing. The fascinating XQP bugecheck was caused by the fact that the two systems were both working on files with the same FID at the same time. One system was in the process of creating a new file using the unused FID, the other system was in the process of opening an already existing file with the same FID on its disk. The system creating the file attempted to gain the appropriate lock for its newly created file. Discovered that another system had a lock on this "not-yet existing" file and promptly bug-checked! StuartR
5254.6	No requirement for cluster-wide unique unit numbers.....	STAR::CROLL		`Fri Mar 28 1997 10:02`	23
	This is my day for editorial corrections, I guess.... My reply in .4 is incorrect. There is no requirement that device unit numbers be unique in a cluster as long as the allocation class rules are followed. In fact, on the STAR cluster, one of the VMS development clusters, there are two devices named DKA200 as I write this. Different allocation classes, of course. The MSCP unit identifier contains, in addition to the allocation class, unit number, and controller letter, the device type letters, the "dd" part of the ddcu device name. These are encoded in the 64-bit MSCP unit ID. (You will notice this if you look in the right places in the code, as I did this morning!) This shows that you can have devices named DUA100 and DKA100 in the same cluster at the same time, even within the same allocation class. I also looked at the manual (What? Not the manual!), and there's no requirement there for unique unit numbers across the entire cluster. Just unique names. Unique SCSI unit numbers are (today, anyway) well-nigh impossible in any largish cluster, and anyway, if there were unique unit numbers, there'd be no need for allocation classes. Sorry for any confusion my earlier reply caused. John
5254.7	Only problem in MOunt verification?	KEIKI::WHITE	MIN(2�,FWIW)	`Fri Mar 28 1997 14:43`	7
	John, Isn't the problem with unique allocation/mscp unit numbers just in the mount verification routines? Did you look there? Bill
5254.8		VMSSG::FRIEDRICHS	Ask me about Young Eagles	`Mon Mar 31 1997 11:32`	20
	While this has no bearing on the problem in .0, I should mention that it appears that V7.1 introduced a "feature"... It appears that there is an error in DUDRIVER that can lead to corruption of the IO database if there are duplicate device/unit numbers (but different allocation classes). As a result, device will not mount verify correctly nor will they MSCP fail over correctly. There were also problems reported with MOUNTing a device in this state (not surprising). The customer that IPMTed this problem has renumbered all of the disks so that they have unique unit numbers, and has not seen the problem since. (CFS.49819) This problem is in V7.1 and V6.2 COMPAT and CLUSIO kits. It will not be addressed in the upcoming DRIV01_071 kit, which will have other fixes for DUdriver and TUdriver. Cheers, jeff
5254.9		EEMELI::MOSER	Orienteers do it in the bush...	`Mon Mar 31 1997 14:12`	5
	you mentioned DUDRIVER. Can I assume that it does not affect then SCSI devices, as this seems to be the more general cases with over- lapping unit numbers? /cmos
5254.10		EVMS::MORONEY		`Mon Mar 31 1997 14:42`	1
	DUDRIVER is involved when talking to MSCP served SCSI devices.
5254.11	Previous Discussions Here (Somewhere)	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Mon Mar 31 1997 15:42`	9
	: Isn't the problem with unique allocation/mscp unit numbers just : in the mount verification routines? Did you look there? There was a long discussion on this a while back somewhere in this conference, and the limitation (from memory) was in one of the MSCP failure paths. It may well be mount verification... It was rather obscure, but it had been occasionally seen...
5254.12	See 5259.5 for a new manefestation of non-unique unit numbers...	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Wed Apr 02 1997 11:39`	0