[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:+ OpenVMS Clusters - The best clusters in the world! +
Notice:This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:PROXY::MOORE
Created:Fri Aug 26 1988
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5320
Total number of notes:23384

5268.0. "read IO's going to MSCP disks after VAXSHAD12_061" by QUARK::SEKULA::BOJOVIC () Tue Mar 25 1997 23:37

Hello,

BRS site with FDDI, 3-member shadow sets (2 on one side of FDDI), VAXSHAD09_061
applied. No problem, all read IO's are going to local HSJ disks.

VAXSHAD12_061 applied, cluster rebooted, and now 1/3 of all read IO's are
going to MSCP served disk across FDDI ?! No increase in load, I/O Request
Queue Length less then 1 for individual drives !

A bug in VAXSHAD12_061 or something else ?

Thank you for reading this, regards,

Sekula Bojovic
CSC Sydney

P.S. cross posted in VSM and CLUSTER conference.
T.RTitleUserPersonal
Name
DateLines
5268.1UTRTSC::utojvdbu1.uto.dec.com::JurVanDerBurgChange mode to Panic!Wed Mar 26 1997 02:0966
That's a change in design. The old SHdriver favored local disks for read,
the new one looks at the queue length to the disks. If the local device gets 
overloaded the mscp served disks will be selected.

I Qar'd the old behaviour some time ago for a specific case.

Jur.

QAR:

A customer complained that a backup of a shadowset took 50% more time from
one node than from another node. The config consists of two HS211 storage
servers (nothing more than an AlphaServer 1000) and two 2100's. The tapeunits
are connected to the HS211's. VMS is V6.2 with ALPSHAD03_062.

A shadowset consists of two disks, each local on one of the HS211's.
When a backup is started on one of the HS211's (say A) we see that only
the local disk on this system is accessed, and never the remote mscp-served
member located on the other HS211 (system B). This is the system where the
backup takes the longest time. If we run the backup on HS211 B we see that
both the local (from system B's view) and the remote (on system A) member
are accessed. The backup on system B takes 50% less time. Why the difference?

Well, the algorithm in SHdriver to select the best member for read may need
some improvements. If we have a shadowset with two equal members (both
local, or both mscp-served) then we will select them just alternating.
If the set consists of a local member and an mscp-served member we will
normally only select the local member. That is, unless there's a queue
on the device. In that case we will start looking for another member to
do the read from, but since the only other member is mscp-served we don't
select it. this search continues until we come to a queue depth of 20,
in which case we say 'well, something's probably wrong, we will just read
from the master member'. And that's what happens. The difference now is that
normally one of the systems mounting the shadowset will have the local disk as
the master member (this depends on the sequence of the disks in the mount
command), and that will be the system with the low throughput. For the other
system the master member is remote, and that's why it will use both members
to read from.

As long as you limit backup via quota's to no more than 20 concurrent i/o's
any read will always go to the local disk. As you increase the quota's you
will see that once you hit a queue of 20 the i/o's will start to go to the
master member as well, which will be the remote disk for one system.

In this way we can explain why it takes much more time to make a backup
on one system instead of another system. The 'solution' in this case was
to move the tapeunits away from the HS211, and to the 2100's. These systems
only see mscp-served devices, so the problem won't show up there.

At a first glance it seemed to be logical to connect the tapeunits to the
systems which have local access to the disks, but that way you may lose
performance.

Now i know that shadowing is not meant to increase the performance but to
increase availability, but the reason for this qar is that i think there's
room for improvement. In the normal case (equal members) we will select
the members alternating on every i/o to make sure that they are available.
In our case we don't do that, so we may (if the queue is below 20) never
go out to read the remote member, so we may not detect a problem with it.

Bottom line:

I think there's no problem to access the remote member if the local member
has a queue > than 1 on it. This would be simple to implement, solve the
perceived performance problem and make sure that all members are available.

5268.2EVMS::MORONEYWed Mar 26 1997 12:2418
re .0:

The algorithm for selecting read masters in shadowing was changed in
SHADOW96.  The old algorithm was, honestly, brain-damaged, and would not
use a remote member even if the remote member was entirely idle and the
local member had a long queue length.  The action in response to a queue
length of > 20 was totally incorrect.  .1 mentions some of the details.

The new algorithm selects by queue length.  This is self-regulating, if, for
example, the remote disk takes twice as long to respond as a local disk, its
queue length will tend to rise and the local disk will be favored. The converse
is true, if somehow the remote disk responds faster it will be favored. 

Since the queue length are equal (apparently nearly always 0) a "round robin"
algorithm is used to spread the load. The small queue length tells me there
isn't going to be any speedup if only local drives were used. 

-Mike
5268.3AMCFAC::RABAHYdtn 471-5160, outside 1-810-347-5160Wed Mar 26 1997 13:5518
The underlying assumption is the members of a shadow set perform nearly
identically.  If this is not true then the load balancing can work against you.

Suppose one member can service 100 I/O's in 1 second while the other can only
service 50 (perhaps a firmware difference is the cause).  This translates into a
service time of 10ms and 20ms respectively.  If a burst of 4 reads are
distributed evenly then the total service time will be 40ms.  If 3 reads had
gone to the fast member and the other to the slow member then the total service
time would have only been 30ms.  The bigger the burst the more it hurts.

A superior algorithm would dynamically partition the reads across the available
members in an effort to minimize service time.  Perhaps the best benefit would
come from partitioning based upon LBN.  For two equally capable members and a
random access pattern to the whole LBN range this would send the low half of the
LBN range to one member and the high half to the other.  For a volume which is
only 50% populated, again with a random access pattern, one member would get the
low � and the other the 2nd �.  For a volume with two spikes in the LBN
histogram the partition would end up somewhere in between.
5268.4AMCFAC::RABAHYdtn 471-5160, outside 1-810-347-5160Wed Mar 26 1997 14:272
Ultimately, pass behavior may not be a good predictor for future activity.  So,
an API enhancement to allow the application to suggest a member might be nice.
5268.5Some times simple is betterVMSSPT::JENKINSKevin M Jenkins VMS Support EngineeringThu Mar 27 1997 07:5814
    
    This change was hashed about quite a bit before the change was made.
    You can come up with all kinds of wonderful algorithms. And then
    you can also find a case that won't work for them as well. It was
    decided so just simplify the whole thing. The original code cared only
    about Local or Remote. It would use local over remote until the 
    queue length got to 20. At that point it would use only the Master.
    This would cause some undesirable affects under heavy loads.

    The new code simply uses queue depth. If a member is slower than
    another for any reason... it will maintain a higher queue than a faster
    member and hence not be selected as often.


5268.6VMS 7.1 feature?CHEFS::MCCAUGHAN_SShaun Mc Caughan 842-3515 @BSOTue Apr 29 1997 11:007
    I am right in assuming that the latest shadow patches for V6.1, 6.2 on
    Alpha or VAX would show the new behaviuor with the read I/O's and this
    would in VMS 7.1 as shipped?
    
    Regards
    
    Shaun
5268.7UTURBO::utoras-198-48-94.uto.dec.com::JurVanDerBurgChange mode to Panic!Tue May 06 1997 02:106
Re .-1

Yes.

Jur.