[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:	+ OpenVMS Clusters - The best clusters in the world! +
Notice:	This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:	PROXY::MOORE

Created:	Fri Aug 26 1988
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5320
Total number of notes:	23384

5313.0. "More than one HSJ disk in a cluster w/same unit?" by CRLRFR::BLUNT () Mon May 19 1997 11:26

    
    I'm aware that this is a rudimentary question, but I can't find a
    reference anywhere for this situation.  For the sake of argument,
    let's say that I'm adding some HSJs to an existing cluster.  During the
    configuration process, what would the expected behavior be if a duplicate 
    device number is added and allowed to go on-line in the cluster (for
    instance, redundant controllers X&Y have a D210, and redundant
    controllers M&N ADD another D210)
    
    This is a 3-node cluster of TurboLasers with 20 HSJ40 controllers.  The
    This cluster has 4-CI paths, and the HSJs are basically evenly spread
    across the CIs.  In the above config, the HSJ pair with the EXISTING 
    D210 are at a HIGHER CI address than the controller pair on which a
    second D210 was added.
    
    Frankly, it IS understood that this is an invalid config.  I'm figuring
    that the results may vary.  However, I'm being asked to provide an
    explaination WHY the D210 that WAS mounted cluster-wide was preempted
    by the "NEW" D210.  Someone made a mistake, and now we're required to
    provide a reason for the action our system "took."
    
    Thanks
    
    bob

T.R	Title	User	Personal Name	Date	Lines
5313.1		BSS::JILSON	WFH in the Chemung River Valley	`Mon May 19 1997 12:54`	4
	We don't test unsupported operations so how can we describe what behaviours happen when you do something unsupported? Jilly
5313.2	Use a different Allo class	SSDEVO::MARTENS	Bert Martens, CXO Storage Solutions	`Mon May 19 1997 14:41`	14
	The CI address is not part of the issue. The cause of the problem is MSCP will not support 2 units with the same unit number in the same allocation class. It should spin down the unit(s). Since the only component that will notice the duplicate unit numbers is OpenVMS, that is what took the action. Notice that the placing of the drive into mountverify is used to prevent data integrity issues. I don't remember what the "rules" that OpenVMS uses to manage this condition. Regards, Bert
5313.3	See 5254.* For Another Unit Number Discussion	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Mon May 19 1997 17:48`	0
5313.4	Both drives spun down	CSC32::S_DANNEN	Live long and slobber	`Tue May 20 1997 08:54`	3
	Last time I saw this, both drives spun down /steve
5313.5		STAR::CROLL		`Tue May 20 1997 09:33`	18
	What do you mean by "preempt"? Can you give some more details about exactly what happened? I've been poking around in DUDRIVER, and I don't see any place where DUDRIVER spins down a unit when there are duplicate unit numbers; DUDRIVER appears to treat duplicate unit numbers (i.e., new unit attention messages that match an existing UCB) as a new path to the existing unit. From DUDRIVER's perspective, the scenario described in .0 would simply be a third (and fourth) path to the same gizmo. I'll go ask the DUDRIVER maintainer to see if my interpretation is correct.... John
5313.6	It's a feature.....	STAR::CROLL		`Tue May 20 1997 10:00`	30
	I talked with the DUDRIVER maintainer, and he confirmed my interpretation in the prior reply. If by "preeempt" in the base note, you meant that I/O operations started going to the new D210 instead of the old one, this probably happened as a result of the static load balancing DUDRIVER performs when it discovers a new path. DUDRIVER gets a snapshot of the current load on the HSJ at the time it forms a connection, and uses this in path assignment when new paths to existing units show up. What probably happened is when the new path to D210 was discovered, DUDRIVER noticed that the new HSJ had less load then the old one, and therefore switched the path to the new HSJ. DUDRIVER matches the allocation class, device class letters (the "dd" part of the "ddcu" device name), and the unit number against existing units. If there's a match, DUDRIVER assumes it's another path to the same unit. This is why you must always have different allocation classes for different units -- this is a good idea even if you don't have units with the same unit number. Is this enough of an explanation? If you need something more formal, log an IPMT.... John
5313.7	'Bout what I thought	CRLRFR::BLUNT		`Tue May 20 1997 13:36`	12
	Yes, I/O ops started going to the new unit. In their config, this was unfortunately an Oracle index. Bad thing. However, I've received the information that I needed. While I understand that "we" don't test unsupported configurations, I can't imagine that "we" haven't at some time (either planned or not) idiot tested our gear (or KNOW explicitly what would happen). The bottom line was that the customer did, and wanted an explanation. So, John, this level of explanation is fine. Thanks! bob
5313.8	Ancient history	CSC32::S_DANNEN	Live long and slobber	`Tue May 20 1997 14:57`	13
	John, I am referring to old history, of course. I was putting together several stacks of RA82's on hsc70's, cut the wrong piece of off the unit plug on one drive (duplicate unit numbers). Drive would go through self-test at power up, spin up, then set the fault led's indicating microprocessor board, HDA, power supply, hybrid board, and spin down. Imagine how upset I was after replacing everything except checking the unit plug! :) stange thing was that the duplicate unit number drive was in another SAxxx cab, and would only fail when it completed it's idle loop self test (these disks were not on-line to VMS at the time) Ah the good old days! /steve
5313.9		STAR::CROLL		`Wed May 21 1997 09:59`	19
	re .8: I believe that if an HSC or HSJ sees duplicate unit numbers on the drives directly connected, it'll spin one (or both) down. The HSx has more knowledge about the configuration and what's "legal" -- DUDRIVER on the other hand, has to deal with a lot more configuration complexity..... re .7: As for idiot testing: we do do a huge amount of testing, but we concentrate on making sure the stuff we officially support works properly. Stuff that is not supported either doesn't work, or was never designed to work in the unsupported ways. The supported stuff is complicated enough without opening up the test matrix to everything else. Besides, we did know what was going on in this situation; it just took a bit of digging. John