[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:	+ OpenVMS Clusters - The best clusters in the world! +
Notice:	This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:	PROXY::MOORE

Created:	Fri Aug 26 1988
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5320
Total number of notes:	23384

5214.0. "Cluster / Disaster Tolerant questions" by COPCLU::BRIAN (Brian Krause @DMA, System Specialist) Thu Jan 23 1997 07:53

T.R	Title	User	Personal Name	Date	Lines
5214.1	I was a BRS client!	EVMS::PERCIVAL	OpenVMS Cluster Engineering	`Thu Jan 23 1997 08:51`	56
5214.2		EVMS::MORONEY	UHF Computers	`Thu Jan 23 1997 11:43`	7
5214.3	I'll get more info	COPCLU::BRIAN	Brian Krause @DMA, System Specialist	`Fri Jan 24 1997 04:53`	1

5214.4	Ooops	COPCLU::BRIAN	Brian Krause @DMA, System Specialist	`Fri Jan 24 1997 05:26`	56
	Oooopps - what happened here? I'll try again: Re .1: Hi Ian, > 1. Yes :-) (sorry I couldn't resist!!!) Well - I asked for it ;-) 2. > During backups, if opposite lobe access is required by applications to > a disk, you may well see a more significant degradation (we actually > measured this at 8ms per I/O - though many factors are involved in this > delay - it will be different for you!). If this is important, you could > add the third member as you suggest in your question #5. You could > also manage your applications such that any given one only runs > primarily on one site. Thus with duplicate site tape devices you will > always have a local disk available to that application. The application or to be honest the applications all uses the same data, so there will be remote access during the backup period. But we could add an extra shadow set If it is a problem. 3. > The effect of the loading will depend on the number of writes, and the > current CPU utilisations. Are your machines very heavily used, what are > your average INT STACK And Kernel mode utilisations? With an 80/20 > utilisation you probably do not have a huge problem - but it all > depends on your I/O volume. I will try to get some more info and get back. Re .2: Mike - This doesn't sound very smart - or am I missing the point? The application and I/O load is pretty much symmetric between the two computer rooms. This means that an algoritm that favours local I/O would give almost no remote I/O. But if the algoritm goes for the shortest queue, the I/O would be split as roughly half local, half remote - based on the assumption that the queue length is about the same on both disk sets. The bottom line is then poorer I/O performance. This goes for read ofcourse. For write there would be no difference between the two algoritms, as the data always needs to be written on both disk sets. Did I understand it correct? Any pointers to documents I should read? Best regards, Brian.
5214.5		EVMS::MORONEY	UHF Computers	`Mon Jan 27 1997 16:24`	55
	re .4: > Mike - This doesn't sound very smart - or am I missing the point? > > The application and I/O load is pretty much symmetric between the two > computer rooms. This means that an algoritm that favours local I/O > would give almost no remote I/O. The new algorithm should be faster overall than the old one as implemented. It will favor the faster device over the slower based on the queue length, so it will tend to select the faster device no matter what the reason for the slower device being slower. > But if the algoritm goes for the shortest queue, the I/O would be > split as roughly half local, half remote - based on the assumption that > the queue length is about the same on both disk sets. The bottom line > is then poorer I/O performance. If the queues for devices that are responding at different speeds are equal, the load is being split in proportion to its speed. The fast device is much "better" at reducing the length of its queue, so it takes more I/Os to maintain its queue length the same as a slower one. Consider this thought experiment: Have two stacks of coins. Every second you remove two coins from one stack and only one coin from the other stack, if there are coins to be removed on the stacks. The first stack represents a device that can do 2 I/Os per second, the second can only do one I/O per second. You also add coins to the stacks all the time. Sometimes you add one coin to whichever stack is the shortest. This represents a read I/O. Sometimes you add one coin to each stack. This is a write I/O. For a read I/O keep track of how often you put a coin on the first stack and how often on the second stack. If both stacks are the same size, select one at random when doing a read I/O. I'll tell you what you'll see. If the stacks are nearly always empty, you'll be splitting the coins 50-50. This is a nearly idle system. If you are adding coins fast enough such that they don't get decreased to 0, and you're always doing read I/Os, you'll find yourself adding coins to the first pile twice as often as the second "slower" pile. If you are adding one coin to all piles fairly often (a write I/O) the bias in favor of the faster disk for reads becomes even greater. If the system interlink is a bottleneck, the remote device will not shorten its queue as quickly as the local device. The local device will get more of the reads. If the system interlink is not a bottleneck such that the local and remote devices respond at nearly equal speeds (say two Alphas connected via Memory Channel) there is no reason to pick on the local device. For your application (load and shadowsets split evenly between two sites) and 100% reads, you'll see some increase in "unnecessary" cross traffic, but it will be self-throttling. With more writes (which must be cross-traffic anyways) the reads will tend to stay local. -Mike
5214.6	OK - we talk coins ;-)	COPCLU::BRIAN	Brian Krause @DMA, System Specialist	`Tue Jan 28 1997 03:28`	30
	Re .5 Well pile coins or I/O queue - I don't care. My point is: What one machine regards as remote is local to the other. As the machines at both sides makes about the same number of I/O's the queues will both be about the same length. And then the I/O's will be split equal - you say? Or - if you want to put it in coins ;-) I have a pile of coins, my pal has a pile of coins. When I take a coin, I always take a coin from the smallest pile (very unsocial - isn't it). It takes a little longer to grap a coin from my pals pile, as I have to reach for it. But I don't care - I ALWAYS take a coin from the smallest pile. This way I gets more coins, I think. If the piles are of eqaul size i toss a coin ;-) and choose whatever pile. My pal thinks the same way. This way we grap half from our own piles, half from each others. As someone is so nice always to add coins to both our piles. The same number to me and my pal. Well - now the point: When we grap half the coins from each others piles, and this takes longer than if we took all the coins from our own piles, then we both get less coins, and the piles of coins get bigger and bigger. Or if you will - the disk queues grows, and we make less I/O's. Am I wrong?
5214.7		EVMS::MORONEY		`Tue Jan 28 1997 14:46`	27
	re .6: > My point is: What one machine regards as remote is local to the other. > As the machines at both sides makes about the same number of I/O's the > queues will both be about the same length. And then the I/O's will be > split equal - you say? You must remember that there are 4 'stacks of coins', not 2. One pair is System 1's picture of the two members and its (System 1's) I/Os, the other is System 2's picture of the two members and its (System 2's) I/Os. As one of each pair is remote and the other local it is NOT accurate to say they will do the same number of I/Os. The underlying disks will (assuming identical setups) but the I/O to the underlying disks are the TOTAL of that from both systems. One system doesn't get to see the size of the other's queue. Also I think that you misunderstand the picture. Coins "taken" from the stack represent completed I/Os, and MSCP served I/Os don't get lower priority to be completed as you seem to imply. There will just be a transfer delay. > I have a pile of coins, my pal has a pile of coins. When I take a coin, > I always take a coin from the smallest pile (very unsocial - isn't it). I don't understand what the "take a coin from the smallest pile" is supposed to represent. Shadowing _adds_ coins to the smallest pile (disk with shortest queue) -Mike
5214.8	queue length is less than optimal	AMCFAC::RABAHY	dtn 471-5160, outside 1-810-347-5160	`Tue Jan 28 1997 15:28`	47
	Each node has a seperate queue for each member of a shadow set. Suppose $1$DUA100: is at site 1 and $2$DUA100: is at site 2. They are combined into a shadow set called DSA100:. An application running on a node at site 1 has a local (i.e. direct) path to $1$DUA100 and a remote (i.e. MSCP-served) path to $2$DUA100:. When the application issues a read to DSA100:, the DS driver simply forwards the read to one of the members of the shadow set. Which member is choosen depends on the queue length. If the queues are the same length then it is random. Each node maintains a node-private queue to each member of the shadow set. The queues on one node do not effect the queues on another node. The coin thought experiment was twisted from .4 to .5 -- in .4 the taker of coins is the device driver, the source of coins is the application; in .5 apparently there is some sort of competition to get the most coins. The object is to get the most work done as quickly as possible. There is a tradeoff between bandwidth and latancy to be considered. I think where .5 went wrong was you are suppose to envision you have two piles, your friend has two piles. Putting a coin onto one of your piles represents issuing a read, putting one coin on each of your piles represents issuing a write. You never ever touch your friend's piles. Meanwhile someone is taking coins from one of your piles and the corresponding one of your friend's piles. Suppose one member of shadow set can do 1,000 reads per second (good cache), if the response time is uniform, then the latancy is 1ms. Further suppose the other member of the shadow set can only do 1 read per second, again with a uniform response time the latancy is 1s. Now, the first read finds both queues empty and so is forwarded randomly -- half going to the fast disk and the other half going to the slow disk. This is less than optimal but who the heck cares? Only one bloody read has been issued and even if it takes 1s to do there's little harm done. Now, a second read comes later, enough later that the first read is long gone. The same logic applies here. Only when a burst of reads are issued within a short enough window of time do the queues begin to build. Since the fast member can get 1,000 reads done is 1 second, only if an instaneous load of 1,001 reads is presented should a read go to the slow member. If 500 reads are presented at t0 and 501 are presented at the t0+.5s then the best performance is achieved by sending one of the original reads to the slow member but it is kinda hard to know that at time t0. Btw, a synchronous application is clearly hurt by this algorithm. Presumably, either the members are not so out of balance, modern applications use asynchronous I/O, middleware intervenes to transform synchronous I/O into asynch or multiple applications will be executing concurrently minimizing the effect by getting over the threshold where a goodly proportion of reads will be finding their way to the fast member.
5214.9		AMCFAC::RABAHY	dtn 471-5160, outside 1-810-347-5160	`Tue Jan 28 1997 15:58`	7
	I doubt it is possible for the DS driver to know how quickly each member will perform a particular read. Some heuristic is required. Queue length might be as good as it gets. Perhaps reads should be partitioned based on LBN? Send all those in the lower range to one member and those in the higher range to the other in an effort to minimize seek time? Naturally writes break the symmetry.
5214.10		EVMS::MORONEY		`Tue Jan 28 1997 16:15`	11
	re .9: No there really isn't much other info available to determine the fastest response time. Besides, shadowing's purpose in life is data availability and reliability, any speedup is a "freebie" bonus, not a goal of the driver. Partitioning by LBN wouldn't work that well anyway, consider the situation of a heavily used database file that resided on the first 1/3 of a disk drive, with the rest of the drive mostly unused. -Mike
5214.11		AMCFAC::RABAHY	dtn 471-5160, outside 1-810-347-5160	`Tue Jan 28 1997 16:28`	7
	Actually, my long time standing point of view is to not give any performance benefit. Too many times I've had customers become dependent upon the level of performance given when the shadow set is operating normally and be severely disappointed when the performance degraded during a failure and the even worse performance during the recovery. They want shadowing to give the same level of performance at all times, even during a recovery. Instead they end up having to carefully monitor utilization and reconfigure artifically early.
5214.12	Isn't there another effect?	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Tue Jan 28 1997 17:51`	5
	When the remote system sends me a read or write, and it gets executed by my MSCP server on my local disk, does it get inserted into my local queue for that disk? Wouldn't that affect the load-balancing measurement? Or do these forwarded I/O's bypass that queue somehow?
5214.13		UTRTSC::thecat.uto.dec.com::JurVanDerBurg	Change mode to Panic!	`Wed Jan 29 1997 02:08`	9
	>When the remote system sends me a read or write, and it >gets executed by my MSCP server on my local disk, does >it get inserted into my local queue for that disk? The mscpserver just talks to the local disk driver, and any i/o request is just inserted in the local queue. Jur.
5214.14		AMCFAC::RABAHY	dtn 471-5160, outside 1-810-347-5160	`Wed Jan 29 1997 13:53`	15
	re .11: So, sell 'em a third member to add to the shadow set. Then when there is a failure the performance will degrade only to the level of two members. I suppose they might still be disappointed after being spoiled by the performance of three members. It would be nice to have a feature which let us delibrately throttle performance to avoid degradation during failure/recovery. re .10: Hmm, let the boundary float dynamically attempting to balance the load. How much write activity can be tolerated before an LBN-based partition becomes unacceptable?