[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:+ OpenVMS Clusters - The best clusters in the world! +
Notice:This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:PROXY::MOORE
Created:Fri Aug 26 1988
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5320
Total number of notes:23384

5313.0. "More than one HSJ disk in a cluster w/same unit?" by CRLRFR::BLUNT () Mon May 19 1997 12:26

    
    I'm aware that this is a rudimentary question, but I can't find a
    reference anywhere for this situation.  For the sake of argument,
    let's say that I'm adding some HSJs to an existing cluster.  During the
    configuration process, what would the expected behavior be if a duplicate 
    device number is added and allowed to go on-line in the cluster (for
    instance, redundant controllers X&Y have a D210, and redundant
    controllers M&N ADD another D210)
    
    This is a 3-node cluster of TurboLasers with 20 HSJ40 controllers.  The
    This cluster has 4-CI paths, and the HSJs are basically evenly spread
    across the CIs.  In the above config, the HSJ pair with the EXISTING 
    D210 are at a HIGHER CI address than the controller pair on which a
    second D210 was added.
    
    Frankly, it IS understood that this is an invalid config.  I'm figuring
    that the results may vary.  However, I'm being asked to provide an
    explaination WHY the D210 that WAS mounted cluster-wide was preempted
    by the "NEW" D210.  Someone made a mistake, and now we're required to
    provide a reason for the action our system "took."
    
    Thanks
    
    bob
T.RTitleUserPersonal
Name
DateLines
5313.1BSS::JILSONWFH in the Chemung River ValleyMon May 19 1997 13:544
We don't test unsupported operations so how can we describe what behaviours 
happen when you do something unsupported?

Jilly
5313.2Use a different Allo classSSDEVO::MARTENSBert Martens, CXO Storage SolutionsMon May 19 1997 15:4114
    The CI address is not part of the issue. The cause of the problem is 
    MSCP will not support 2 units with the same unit number in the same
    allocation class. It should spin down the unit(s). Since the only
    component that will notice the duplicate unit numbers is OpenVMS,
    that is what took the action. Notice that the placing of the drive
    into mountverify is used to prevent data integrity issues.
    
    I don't remember what the "rules" that OpenVMS uses to manage this
    condition.
    
    
    Regards,
    Bert
    
5313.3See 5254.* For Another Unit Number DiscussionXDELTA::HOFFMANSteve, OpenVMS EngineeringMon May 19 1997 18:480
5313.4Both drives spun downCSC32::S_DANNENLive long and slobberTue May 20 1997 09:543
    Last time I saw this, both drives spun down
    
    /steve
5313.5STAR::CROLLTue May 20 1997 10:3318
What do you mean by "preempt"?

Can you give some more details about exactly what
happened?

I've been poking around in DUDRIVER, and I don't see any
place where DUDRIVER spins down a unit when there are
duplicate unit numbers; DUDRIVER appears to treat
duplicate unit numbers (i.e., new unit attention
messages that match an existing UCB) as a new path to
the existing unit.  From DUDRIVER's perspective, the
scenario described in .0 would simply be a third (and
fourth) path to the same gizmo.

I'll go ask the DUDRIVER maintainer to see if my
interpretation is correct....

John
5313.6It's a feature.....STAR::CROLLTue May 20 1997 11:0030
I talked with the DUDRIVER maintainer, and he confirmed
my interpretation in the prior reply.

If by "preeempt" in the base note, you meant that I/O
operations started going to the new D210 instead of the
old one, this probably happened as a result of the
static load balancing DUDRIVER performs when it
discovers a new path.

DUDRIVER gets a snapshot of the current load on the HSJ
at the time it forms a connection, and uses this in path
assignment when new paths to existing units show up. 
What probably happened is when the new path to D210 was
discovered, DUDRIVER noticed that the new HSJ had less
load then the old one, and therefore switched the path
to the new HSJ.

DUDRIVER matches the allocation class, device class
letters (the "dd" part of the "ddcu" device name), and
the unit number against existing units.  If there's a
match, DUDRIVER assumes it's another path to the same
unit.  This is why you must always have different
allocation classes for different units -- this is a good
idea even if you don't have units with the same unit
number.

Is this enough of an explanation?  If you need something
more formal, log an IPMT....

John
5313.7'Bout what I thoughtCRLRFR::BLUNTTue May 20 1997 14:3612
    
    Yes, I/O ops started going to the new unit.  In their config, this was
    unfortunately an Oracle index.  Bad thing.  However, I've received the
    information that I needed.  While I understand that "we" don't test
    unsupported configurations, I can't imagine that "we" haven't at some
    time (either planned or not) idiot tested our gear (or KNOW explicitly
    what would happen).  The bottom line was that the customer did, and
    wanted an explanation.  
    
    So, John, this level of explanation is fine.  Thanks!
    
    bob
5313.8Ancient historyCSC32::S_DANNENLive long and slobberTue May 20 1997 15:5713
    John,
    I am referring to old history, of course. I was putting together
    several stacks of RA82's on hsc70's, cut the wrong piece of off
    the unit plug on one drive (duplicate unit numbers). Drive would
    go through self-test at power up, spin up, then set the fault led's
    indicating microprocessor board, HDA, power supply, hybrid board,
    and spin down. Imagine how upset I was after replacing everything
    except checking the unit plug! :) stange thing was that the duplicate
    unit number drive was in another SAxxx cab, and would only fail when
    it completed it's idle loop self test (these disks were not on-line
    to VMS at the time) Ah the good old days!
    
    /steve
5313.9STAR::CROLLWed May 21 1997 10:5919
re .8:
I believe that if an HSC or HSJ sees duplicate unit
numbers on the drives directly connected, it'll spin one
(or both) down.  The HSx has more knowledge about the
configuration and what's "legal" -- DUDRIVER on the
other hand, has to deal with a lot more configuration
complexity.....

re .7:
As for idiot testing:  we do do a huge amount of
testing, but we concentrate on making sure the stuff we
officially support works properly.  Stuff that is not
supported either doesn't work, or was never designed to
work in the unsupported ways.  The supported stuff is
complicated enough without opening up the test matrix to
everything else.  Besides, we *did* know what was going
on in this situation; it just took a bit of digging.

John