[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::vmsnotes

Title:	VAX and Alpha VMS
Notice:	This is a new VMSnotes, please read note 2.1
Moderator:	VAXAXP::BERNARDO

Created:	Wed Jan 22 1997
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	703
Total number of notes:	3722

177.0. "DEADLOCKWAIT Granularity Too Coarse" by WHOS01::BOWERS (Dave Bowers, NSIS/IM) Tue Feb 11 1997 15:37

    The customer is running a high-volume, high-concurrency OLTP system and
    is being plagued by Rdb deadlocks. The DBA feels that if he could set
    DEADLOCKWAIT to a subsecond value (the system is a TurboLaser) overall
    thoughput could be improved significantly. He feels that this parameter
    (which is, of course, in whole seconds) is no longer scaled correctly
    for newer, faster processors.
    
    Is anyone considering changing the units for this parameter in future
    VMS versions?
    
    \dave

T.R	Title	User	Personal Name	Date	Lines
177.1		AUSS::GARSON	DECcharity Program Office	`Wed Feb 12 1997 21:06`	12
	re .0 > Is anyone considering changing the units for this parameter in future > VMS versions? future => product manager I wouldn't have thought the implied change was all that desirable. Surely the customer can fix the application not to deadlock so much. You might want to work out how long it takes to complete a deadlock search.
177.2		WHOS01::BOWERS	Dave Bowers, NSIS/IM	`Thu Feb 13 1997 09:46`	15
	future => product manager => name or e-mail addr? The main problem is that the code is generated by TI's IEF case tool and we really can't control it in any meaningful way. We know it's lousy code in many ways, but we're stuck with it. The overal system design (the other half of the problem) is likewise cast in concrete. The argument being made by the DBA is essentially that the 1 second granularity was imposed when the box ran at .1% of the speed of the current generation. On a 780, a second was a fairly brief interval. On an 8400, it's forever. \dave
177.3		AUSS::GARSON	DECcharity Program Office	`Thu Feb 13 1997 20:28`	37
	re .2 future => product manager => name or e-mail => note 7.2 in this conference It sounds as if an IPMT should be raised for the customer. > The argument being made by the DBA is essentially that the 1 second > granularity was imposed when the box ran at .1% of the speed of the > current generation. On a 780, a second was a fairly brief interval. On > an 8400, it's forever. The goal of DEADLOCK_WAIT was to define a time limit beyond which a queued lock conversion or request was deemed likely to indicate deadlock rather than simply that a process holding the lock hadn't finished with it i.e. to control the amount of CPU on wasted deadlock searches. It is important therefore to identify what is the limiting factor on legitimate lock hold time. If there is no user interaction while locks are held then CPU is one factor but I/O may be another. Note that this includes cluster communications I/O as well as disk I/O. All of these have got faster but CPU has increased by the greatest factor. [If locks are held across user interaction then all bets are off.] At the same time, since CPUs have got faster, one can afford to perform more deadlock searches per unit time and expend only the same CPU time regardless of whether the CPU time is wasted but, on the other hand, lock populations have become larger and so this effect is offset somewhat. So while sub-second DEADLOCK_WAIT may be justified, perhaps not to the extent implied by the DBA. (I agree that 10 seconds default and 1 second minimum looks pretty conservative for a system doing perhaps a thousand database transactions per second.) At the very least they should confirm that when deadlock is not occurring the locks are not held for a period of time exceeding what they would propose for DEADLOCK_WAIT.
177.4		EEMELI::MOSER	Orienteers do it in the bush...	`Wed Feb 19 1997 14:52`	20
	woa, I think this customer wants some more CPU cycle competition! Why does he want to 'decrease' DEADLOCK_WAIT? in order to be faster notified of any deadlocks, i.e. a process issuing a $ENQW getting quicker a SS$_DEADLOCK back? or does he just have to burn some more CPU cycles? you will trigger many many more deadlock searches with a low DEADLOCK_WAIT value. And those are very expensive and at high IPL and no locks can be granted during this time for anybody etc. Bottom line: I don't see any valid point of lowering DEADLOCK_WAIT below 1 sec. I occasionally lower it from the default of 10 sec down to maybe 3 or 5 sec and then watch the timeout queue to see on what resources I might have a contention... /cmos
177.5		AUSS::GARSON	DECcharity Program Office	`Wed Feb 19 1997 17:39`	8
	re .4 > does he just have to burn some more CPU cycles? you will trigger many > many more deadlock searches with a low DEADLOCK_WAIT value. Not unless there are locks timing out. And on the evidence submitted by the customer, while there would be more deadlock searches, they would not be wasted.
177.6		EEMELI::MOSER	Orienteers do it in the bush...	`Thu Feb 20 1997 01:33`	18
	re: .5 I'm still not convinced. Lets say your system is busy and you have a contention for a certain resource. On average a lock request has to wait 0.7 sec for the lock to be granted. With a DEADLOCK_WAIT of 1 sec this means, that you'd normally wouldn't trigger a deadlock search for this lock request because it's removed from the timeout queue before the next round of check. If DEADLOCK_WAIT would be 0.5 sec you would stumble over this lock and trigger a search, which comes back and says "no deadlock", takes the lock off the queue and a fraction of a second later it's anyway gone because in the meantime the lock is granted. I call it waste if you have lots of deadlock searches, but no deadlock found numbers. /cmos
177.7		AUSS::GARSON	DECcharity Program Office	`Thu Feb 20 1997 16:48`	30
	re .6 > I'm still not convinced. Lets say your system is busy and you have a > contention for a certain resource. On average a lock request has to > wait 0.7 sec for the lock to be granted. > If DEADLOCK_WAIT would be 0.5 sec you would stumble over this lock and > trigger a search, which comes back and says "no deadlock", takes the As I wrote in a prior reply... "At the very least they should confirm that when deadlock is not occurring the locks are not held for a period of time exceeding what they would propose for DEADLOCK_WAIT." How realistic is 0.7 sec for an average time that a lock is held? A system that is supposed to complete 200 transactions per second or even 100 transactions per second is going to struggle if that is the correct average. Here, by transaction I mean from the start of a database transaction to the commit or rollback that drops all the locks. This may be less than a business transaction but is the relevant definition in this case. One needs to take into account parallelism in a system so that 100 transactions per second does not mean each transaction lasts 0.01 second on average. The bottom line is that the customer should measure the average transaction duration (or their database software can tell them) before making a case for sub-second DEADLOCK_WAIT but my gut feel is that their claim may have some merit.
177.8		EEMELI::MOSER	Orienteers do it in the bush...	`Fri Feb 21 1997 02:37`	14
	the problem is not necessary the hold/wait time of 'nice behaving' locks, but all the others. For example if I want to see which directory files are busy, I just lower DEADLOCK_WAIT and suddenly lots of F11B$ locks turn up, because too many processes try to create/delete files in large directories, and I can bet lots of money that those .DIR files are larger than 127 blocks. So any of your transaction locks behaving well do not cause any problems, but those others can trigger deadlocks searches like hell, and those searches will also hurt your transaction lock requests. /cmos
177.9	More info	WHOS01::BOWERS	Dave Bowers, NSIS/IM	`Fri Feb 21 1997 11:25`	19
	I went back to the DBA for more info. I appears that most of the deadlocks are Rdb page locks. The real villain here is the application, which was written using TI's IEF product (James Martin methodology): 1. IEF is remarkably naive regarding transaction control (like it uses default transactions). 2. The design of the application has a primary process which writes rows and then passes a key to a secondary process which further processes and updates the row. This of course creates instant "hot spots" on both data and index pages as both processes contend foir access to the same page. Neither of the above problems is amenable to a direct fix, so we're looking to "tune" the system so as to minimize the mess. The good news, if any, is that there is only this one (abeit large) application running on the system. \dave
177.10		EEMELI::MOSER	Orienteers do it in the bush...	`Fri Feb 21 1997 13:19`	12
	do you know which Rdb pages? always the same? then it is pretty likely to be an application issue. If you're interested I have a tool which monitors the lock timeout queue and logs information for locks hanging around there too long, especially who they are blocking and who is waiting. For Rdb locks it will translate them to something an Rdb expert understands, grab and have a look at TUIJA""::LOCK033.A (works for VAX and Alpha and understands almost all Rdb lock types) /cmos
177.11	You must fix those deadlocks, not hide them	HERON::GODFRIND	Oracle Rdb Engineering	`Fri Feb 28 1997 05:12`	74
	If I may join the conversation ... The real problem that needs fixing is the excessive number of deadlocks that happen. Lowering DEADLOCK_WAIT would of course make it possible for VMS to notice those deadlocks faster, but that is at the expense of potentially severe side effects (as Christian pointed out) in terms of additional CPU usage at high IPL. In addition, when an Rdb process receives a deadlock error against one of its pending page lock requests, it will demote all other page locks it may have, which may involve writing pages back to disk that would otherwise have been written lazily at a later stage. Also, there is a very good chance that the same process will need to reacquire some of those locks it gave up right after. So, although page deadlocks in rdb cause no application failure, and even though adjusting DEADLOCK_WAIT to a lower value will seem to make the application more responsive, they are still bad and should be kept to the smallest possible level. > <<< Note 177.9 by WHOS01::BOWERS "Dave Bowers, NSIS/IM" >>> > I went back to the DBA for more info. I appears that most of the > deadlocks are Rdb page locks. The real villain here is the application, > which was written using TI's IEF product (James Martin methodology): I myself have done extensive tuning of databases used by applications built using IEF (although those applications did not require that level of performance - more like the 10 to 15 TPS range). > 1. IEF is remarkably naive regarding transaction control (like it uses > default transactions). That is not exactly true. Some level of control is available (via logical names) to control isolation levels and transaction mode (read only vs read write) for the readers. > 2. The design of the application has a primary process which writes > rows and then passes a key to a secondary process which further > processes and updates the row. This of course creates instant "hot > spots" on both data and index pages as both processes contend foir > access to the same page. > Neither of the above problems is amenable to a direct fix, so we're > looking to "tune" the system so as to minimize the mess. The good news, > if any, is that there is only this one (abeit large) application > running on the system. You may consider using alternate indexing techniques (such as hashing) to better distribute records and index nodes and possibly avoid the contention. Another point to watch is maybe to give up using fast commit (or to set up the primary process so that it checkpoints after each transaction). that way it will give up its page locks after each transaction and let the secondary process get at the pages without locking. Also, adapting index node sizes should have an effect on contention (smaller nodes may help). All this is of course highly speculative. Accurate recommendations would require a detailed look at the application and database design. I recommend you get assistance from some experienced Rdb consultant. There are quite a few available from Oracle and external sources. Where is the customer located BTW ? Just send me mail offline and I may be able to locate names for you ... /albert -- Albert Godfrind Oracle Rdb Engineering Oracle Corporation Email: [email protected] DEC European Technical Center [email protected] 950 Route des Colles Phone: +33/4/92.95.51.63 06901 Sophia-Antipolis Mobile: +33/6/09.97.27.23 France FAX: +33/4/92.95.50.50