[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:	+ OpenVMS Clusters - The best clusters in the world! +
Notice:	This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:	PROXY::MOORE

Created:	Fri Aug 26 1988
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5320
Total number of notes:	23384

5268.0. "read IO's going to MSCP disks after VAXSHAD12_061" by QUARK::SEKULA::BOJOVIC () Tue Mar 25 1997 23:37

Hello,

BRS site with FDDI, 3-member shadow sets (2 on one side of FDDI), VAXSHAD09_061
applied. No problem, all read IO's are going to local HSJ disks.

VAXSHAD12_061 applied, cluster rebooted, and now 1/3 of all read IO's are
going to MSCP served disk across FDDI ?! No increase in load, I/O Request
Queue Length less then 1 for individual drives !

A bug in VAXSHAD12_061 or something else ?

Thank you for reading this, regards,

Sekula Bojovic
CSC Sydney

P.S. cross posted in VSM and CLUSTER conference.

T.R	Title	User	Personal Name	Date	Lines
5268.1		UTRTSC::utojvdbu1.uto.dec.com::JurVanDerBurg	Change mode to Panic!	`Wed Mar 26 1997 02:09`	66
	That's a change in design. The old SHdriver favored local disks for read, the new one looks at the queue length to the disks. If the local device gets overloaded the mscp served disks will be selected. I Qar'd the old behaviour some time ago for a specific case. Jur. QAR: A customer complained that a backup of a shadowset took 50% more time from one node than from another node. The config consists of two HS211 storage servers (nothing more than an AlphaServer 1000) and two 2100's. The tapeunits are connected to the HS211's. VMS is V6.2 with ALPSHAD03_062. A shadowset consists of two disks, each local on one of the HS211's. When a backup is started on one of the HS211's (say A) we see that only the local disk on this system is accessed, and never the remote mscp-served member located on the other HS211 (system B). This is the system where the backup takes the longest time. If we run the backup on HS211 B we see that both the local (from system B's view) and the remote (on system A) member are accessed. The backup on system B takes 50% less time. Why the difference? Well, the algorithm in SHdriver to select the best member for read may need some improvements. If we have a shadowset with two equal members (both local, or both mscp-served) then we will select them just alternating. If the set consists of a local member and an mscp-served member we will normally only select the local member. That is, unless there's a queue on the device. In that case we will start looking for another member to do the read from, but since the only other member is mscp-served we don't select it. this search continues until we come to a queue depth of 20, in which case we say 'well, something's probably wrong, we will just read from the master member'. And that's what happens. The difference now is that normally one of the systems mounting the shadowset will have the local disk as the master member (this depends on the sequence of the disks in the mount command), and that will be the system with the low throughput. For the other system the master member is remote, and that's why it will use both members to read from. As long as you limit backup via quota's to no more than 20 concurrent i/o's any read will always go to the local disk. As you increase the quota's you will see that once you hit a queue of 20 the i/o's will start to go to the master member as well, which will be the remote disk for one system. In this way we can explain why it takes much more time to make a backup on one system instead of another system. The 'solution' in this case was to move the tapeunits away from the HS211, and to the 2100's. These systems only see mscp-served devices, so the problem won't show up there. At a first glance it seemed to be logical to connect the tapeunits to the systems which have local access to the disks, but that way you may lose performance. Now i know that shadowing is not meant to increase the performance but to increase availability, but the reason for this qar is that i think there's room for improvement. In the normal case (equal members) we will select the members alternating on every i/o to make sure that they are available. In our case we don't do that, so we may (if the queue is below 20) never go out to read the remote member, so we may not detect a problem with it. Bottom line: I think there's no problem to access the remote member if the local member has a queue > than 1 on it. This would be simple to implement, solve the perceived performance problem and make sure that all members are available.
5268.2		EVMS::MORONEY		`Wed Mar 26 1997 12:24`	18
	re .0: The algorithm for selecting read masters in shadowing was changed in SHADOW96. The old algorithm was, honestly, brain-damaged, and would not use a remote member even if the remote member was entirely idle and the local member had a long queue length. The action in response to a queue length of > 20 was totally incorrect. .1 mentions some of the details. The new algorithm selects by queue length. This is self-regulating, if, for example, the remote disk takes twice as long to respond as a local disk, its queue length will tend to rise and the local disk will be favored. The converse is true, if somehow the remote disk responds faster it will be favored. Since the queue length are equal (apparently nearly always 0) a "round robin" algorithm is used to spread the load. The small queue length tells me there isn't going to be any speedup if only local drives were used. -Mike
5268.3		AMCFAC::RABAHY	dtn 471-5160, outside 1-810-347-5160	`Wed Mar 26 1997 13:55`	18
	The underlying assumption is the members of a shadow set perform nearly identically. If this is not true then the load balancing can work against you. Suppose one member can service 100 I/O's in 1 second while the other can only service 50 (perhaps a firmware difference is the cause). This translates into a service time of 10ms and 20ms respectively. If a burst of 4 reads are distributed evenly then the total service time will be 40ms. If 3 reads had gone to the fast member and the other to the slow member then the total service time would have only been 30ms. The bigger the burst the more it hurts. A superior algorithm would dynamically partition the reads across the available members in an effort to minimize service time. Perhaps the best benefit would come from partitioning based upon LBN. For two equally capable members and a random access pattern to the whole LBN range this would send the low half of the LBN range to one member and the high half to the other. For a volume which is only 50% populated, again with a random access pattern, one member would get the low � and the other the 2nd �. For a volume with two spikes in the LBN histogram the partition would end up somewhere in between.
5268.4		AMCFAC::RABAHY	dtn 471-5160, outside 1-810-347-5160	`Wed Mar 26 1997 14:27`	2
	Ultimately, pass behavior may not be a good predictor for future activity. So, an API enhancement to allow the application to suggest a member might be nice.
5268.5	Some times simple is better	VMSSPT::JENKINS	Kevin M Jenkins VMS Support Engineering	`Thu Mar 27 1997 07:58`	14
	This change was hashed about quite a bit before the change was made. You can come up with all kinds of wonderful algorithms. And then you can also find a case that won't work for them as well. It was decided so just simplify the whole thing. The original code cared only about Local or Remote. It would use local over remote until the queue length got to 20. At that point it would use only the Master. This would cause some undesirable affects under heavy loads. The new code simply uses queue depth. If a member is slower than another for any reason... it will maintain a higher queue than a faster member and hence not be selected as often.
5268.6	VMS 7.1 feature?	CHEFS::MCCAUGHAN_S	Shaun Mc Caughan 842-3515 @BSO	`Tue Apr 29 1997 11:00`	7
	I am right in assuming that the latest shadow patches for V6.1, 6.2 on Alpha or VAX would show the new behaviuor with the read I/O's and this would in VMS 7.1 as shipped? Regards Shaun
5268.7		UTURBO::utoras-198-48-94.uto.dec.com::JurVanDerBurg	Change mode to Panic!	`Tue May 06 1997 02:10`	6
	Re .-1 Yes. Jur.