[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:+ OpenVMS Clusters - The best clusters in the world! +
Notice:This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:PROXY::MOORE
Created:Fri Aug 26 1988
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5320
Total number of notes:23384

5214.0. "Cluster / Disaster Tolerant questions" by COPCLU::BRIAN (Brian Krause @DMA, System Specialist) Thu Jan 23 1997 07:53

T.RTitleUserPersonal
Name
DateLines
5214.1I was a BRS client!EVMS::PERCIVALOpenVMS Cluster EngineeringThu Jan 23 1997 08:5156
5214.2EVMS::MORONEYUHF ComputersThu Jan 23 1997 11:437
5214.3I'll get more infoCOPCLU::BRIANBrian Krause @DMA, System SpecialistFri Jan 24 1997 04:531
    
5214.4OoopsCOPCLU::BRIANBrian Krause @DMA, System SpecialistFri Jan 24 1997 05:2656
    Oooopps - what happened here? I'll try again:

    Re .1:

    Hi Ian,

>   1.	Yes :-)  (sorry I couldn't resist!!!)

    Well - I asked for it ;-)

    2.

>   During backups, if opposite lobe access is required by applications to
>   a disk, you may well see a more significant degradation (we actually
>   measured this at 8ms per I/O - though many factors are involved in this
>   delay - it will be different for you!). If this is important, you could
>   add the third member as you suggest in your question #5.  You could
>   also manage your applications such that any given one only runs
>   primarily on one site.  Thus with duplicate site tape devices you will
>   always have a local disk available to that application.
    
    The application or to be honest the applications all uses the same data,
    so there will be remote access during the backup period. But we could add 
    an extra shadow set If it is a problem.

    3.

>   The effect of the loading will depend on the number of writes, and the
>   current CPU utilisations.  Are your machines very heavily used, what are
>   your average INT STACK And Kernel mode utilisations? With an 80/20
>   utilisation you probably do not have a huge problem - but it all
>   depends on your I/O volume.

    I will try to get some more info and get back.

    Re .2:

    Mike - This doesn't sound very smart - or am I missing the point?

    The application and I/O load is pretty much symmetric between the two
    computer rooms. This means that an algoritm that favours local I/O
    would give almost no remote I/O.

    But if the algoritm goes for the shortest queue, the I/O would be
    split as roughly half local, half remote - based on the assumption that
    the queue length is about the same on both disk sets. The bottom line
    is then poorer I/O performance.

    This goes for read ofcourse. For write there would be no difference 
    between the two algoritms, as the data always needs to be written on
    both disk sets. Did I understand it correct?

    Any pointers to documents I should read?

    Best regards,
    Brian.
5214.5EVMS::MORONEYUHF ComputersMon Jan 27 1997 16:2455
re .4:

>    Mike - This doesn't sound very smart - or am I missing the point?
>
>    The application and I/O load is pretty much symmetric between the two
>    computer rooms. This means that an algoritm that favours local I/O
>    would give almost no remote I/O.

The new algorithm should be faster overall than the old one as implemented.  It
will favor the faster device over the slower based on the queue length, so it
will tend to select the faster device no matter what the reason for the slower
device being slower. 

>    But if the algoritm goes for the shortest queue, the I/O would be
>    split as roughly half local, half remote - based on the assumption that
>    the queue length is about the same on both disk sets. The bottom line
>    is then poorer I/O performance.

If the queues for devices that are responding at different speeds are equal,
the load is being split in proportion to its speed.  The fast device is much
"better" at reducing the length of its queue, so it takes more I/Os to maintain
its queue length the same as a slower one.

Consider this thought experiment:  Have two stacks of coins.  Every second you
remove two coins from one stack and only one coin from the other stack, if
there are coins to be removed on the stacks.  The first stack represents a
device that can do 2 I/Os per second, the second can only do one I/O per
second.

You also add coins to the stacks all the time. Sometimes you add one coin to
whichever stack is the shortest. This represents a read I/O.  Sometimes you add
one coin to each stack.  This is a write I/O. For a read I/O keep track of how
often you put a coin on the first stack and how often on the second stack.  If
both stacks are the same size, select one at random when doing a read I/O. 

I'll tell you what you'll see.  If the stacks are nearly always empty, you'll
be splitting the coins 50-50.  This is a nearly idle system.  If you are adding
coins fast enough such that they don't get decreased to 0, and you're always
doing read I/Os, you'll find yourself adding coins to the first pile twice as
often as the second "slower" pile.  If you are adding one coin to all piles
fairly often (a write I/O) the bias in favor of the faster disk for reads
becomes even greater. 

If the system interlink is a bottleneck, the remote device will not shorten
its queue as quickly as the local device.  The local device will get more
of the reads.  If the system interlink is not a bottleneck such that the
local and remote devices respond at nearly equal speeds (say two Alphas
connected via Memory Channel) there is no reason to pick on the local device.

For your application (load and shadowsets split evenly between two sites)
and 100% reads, you'll see some increase in "unnecessary" cross traffic, but
it will be self-throttling.  With more writes (which must be cross-traffic
anyways) the reads will tend to stay local.

-Mike
5214.6OK - we talk coins ;-)COPCLU::BRIANBrian Krause @DMA, System SpecialistTue Jan 28 1997 03:2830
    Re .5
    
    Well pile coins or I/O queue - I don't care.
    
    My point is: What one machine regards as remote is local to the other.
    As the machines at both sides makes about the same number of I/O's the
    queues will both be about the same length. And then the I/O's will be
    split equal - you say?
    
    Or - if you want to put it in coins ;-)
    
    I have a pile of coins, my pal has a pile of coins. When I take a coin,
    I always take a coin from the smallest pile (very unsocial - isn't it).
    It takes a little longer to grap a coin from my pals pile, as I have to
    reach for it. But I don't care - I ALWAYS take a coin from the smallest
    pile. This way I gets more coins, I think. If the piles are of eqaul size
    i toss a coin ;-) and choose whatever pile.
    
    My pal thinks the same way. This way we grap half from our own piles,
    half from each others.
    
    As someone is so nice always to add coins to both our piles. The same
    number to me and my pal. Well - now the point: When we grap half the
    coins from each others piles, and this takes longer than if we took all
    the coins from our own piles, then we both get less coins, and the
    piles of coins get bigger and bigger.
    
    Or if you will - the disk queues grows, and we make less I/O's.
    
    Am I wrong?	                                                  
5214.7EVMS::MORONEYTue Jan 28 1997 14:4627
re .6:

>    My point is: What one machine regards as remote is local to the other.
>    As the machines at both sides makes about the same number of I/O's the
>    queues will both be about the same length. And then the I/O's will be
>    split equal - you say?

You must remember that there are 4 'stacks of coins', not 2.  One pair is
System 1's picture of the two members and its (System 1's) I/Os, the other is
System 2's picture of the two members and its (System 2's) I/Os.  As one of
each pair is remote and the other local it is NOT accurate to say they will do
the same number of I/Os.  The underlying disks will (assuming identical setups)
but the I/O to the underlying disks are the TOTAL of that from both systems. 
One system doesn't get to see the size of the other's queue.

Also I think that you misunderstand the picture.  Coins "taken" from the stack
represent completed I/Os, and MSCP served I/Os don't get lower priority to
be completed as you seem to imply.  There will just be a transfer delay.

>    I have a pile of coins, my pal has a pile of coins. When I take a coin,
>    I always take a coin from the smallest pile (very unsocial - isn't it).

I don't understand what the "take a coin from the smallest pile" is supposed
to represent.  Shadowing _adds_ coins to the smallest pile (disk with shortest
queue)

-Mike
5214.8queue length is less than optimalAMCFAC::RABAHYdtn 471-5160, outside 1-810-347-5160Tue Jan 28 1997 15:2847
Each node has a seperate queue for each member of a shadow set.

Suppose $1$DUA100: is at site 1 and $2$DUA100: is at site 2.  They are combined
into a shadow set called DSA100:.  An application running on a node at site 1
has a local (i.e. direct) path to $1$DUA100 and a remote (i.e. MSCP-served) path
to $2$DUA100:.  When the application issues a read to DSA100:, the DS driver
simply forwards the read to one of the members of the shadow set.  Which member
is choosen depends on the queue length.  If the queues are the same length then
it is random.  Each node maintains a node-private queue to each member of the
shadow set.  The queues on one node do not effect the queues on another node.

The coin thought experiment was twisted from .4 to .5 -- in .4 the taker of
coins is the device driver, the source of coins is the application; in .5
apparently there is some sort of competition to get the most coins.

The object is to get the most work done as quickly as possible.  There is a
tradeoff between bandwidth and latancy to be considered.

I think where .5 went wrong was you are suppose to envision you have two piles,
your friend has two piles.  Putting a coin onto one of your piles represents
issuing a read, putting one coin on each of your piles represents issuing a
write.  You never ever touch your friend's piles.  Meanwhile someone is taking
coins from one of your piles and the corresponding one of your friend's piles.

Suppose one member of shadow set can do 1,000 reads per second (good cache), if
the response time is uniform, then the latancy is 1ms.  Further suppose the
other member of the shadow set can only do 1 read per second, again with a
uniform response time the latancy is 1s.  Now, the first read finds both queues
empty and so is forwarded randomly -- half going to the fast disk and the other
half going to the slow disk.  This is less than optimal but who the heck cares? 
Only one bloody read has been issued and even if it takes 1s to do there's
little harm done.  Now, a second read comes later, enough later that the first
read is long gone.  The same logic applies here.  Only when a burst of reads are
issued within a short enough window of time do the queues begin to build.

Since the fast member can get 1,000 reads done is 1 second, only if an
instaneous load of 1,001 reads is presented should a read go to the slow member.
If 500 reads are presented at t0 and 501 are presented at the t0+.5s then the
best performance is achieved by sending one of the original reads to the slow
member but it is kinda hard to know that at time t0.

Btw, a synchronous application is clearly hurt by this algorithm.  Presumably,
either the members are not so out of balance, modern applications use
asynchronous I/O, middleware intervenes to transform synchronous I/O into asynch
or multiple applications will be executing concurrently minimizing the effect by
getting over the threshold where a goodly proportion of reads will be finding
their way to the fast member.
5214.9AMCFAC::RABAHYdtn 471-5160, outside 1-810-347-5160Tue Jan 28 1997 15:587
I doubt it is possible for the DS driver to know how quickly each member will
perform a particular read.  Some heuristic is required.  Queue length might be
as good as it gets.

Perhaps reads should be partitioned based on LBN?  Send all those in the lower
range to one member and those in the higher range to the other in an effort to
minimize seek time?  Naturally writes break the symmetry.
5214.10EVMS::MORONEYTue Jan 28 1997 16:1511
re .9:

No there really isn't much other info available to determine the fastest
response time.  Besides, shadowing's purpose in life is data availability and
reliability, any speedup is a "freebie" bonus, not a goal of the driver.

Partitioning by LBN wouldn't work that well anyway, consider the situation
of a heavily used database file that resided on the first 1/3 of a disk drive,
with the rest of the drive mostly unused.

-Mike
5214.11AMCFAC::RABAHYdtn 471-5160, outside 1-810-347-5160Tue Jan 28 1997 16:287
Actually, my long time standing point of view is to not give any performance
benefit.  Too many times I've had customers become dependent upon the level of
performance given when the shadow set is operating normally and be severely
disappointed when the performance degraded during a failure and the even worse
performance during the recovery.  They want shadowing to give the same level of
performance at all times, even during a recovery.  Instead they end up having to
carefully monitor utilization and reconfigure artifically early.
5214.12Isn't there another effect?WIBBIN::NOYCEPulling weeds, pickin' stonesTue Jan 28 1997 17:515
When the remote system sends me a read or write, and it
gets executed by my MSCP server on my local disk, does
it get inserted into my local queue for that disk?  Wouldn't
that affect the load-balancing measurement?  Or do these
forwarded I/O's bypass that queue somehow?
5214.13UTRTSC::thecat.uto.dec.com::JurVanDerBurgChange mode to Panic!Wed Jan 29 1997 02:089
>When the remote system sends me a read or write, and it
>gets executed by my MSCP server on my local disk, does
>it get inserted into my local queue for that disk?

The mscpserver just talks to the local disk driver, and any i/o request
is just inserted in the local queue.

Jur.

5214.14AMCFAC::RABAHYdtn 471-5160, outside 1-810-347-5160Wed Jan 29 1997 13:5315
re .11:

So, sell 'em a third member to add to the shadow set.  Then when there is a
failure the performance will degrade only to the level of two members.  I
suppose they might still be disappointed after being spoiled by the performance
of three members.

It would be nice to have a feature which let us delibrately throttle performance
to avoid degradation during failure/recovery.

re .10:

Hmm, let the boundary float dynamically attempting to balance the load.  How
much write activity can be tolerated before an LBN-based partition becomes
unacceptable?