[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::vmsnotes

Title:VAX and Alpha VMS
Notice:This is a new VMSnotes, please read note 2.1
Moderator:VAXAXP::BERNARDO
Created:Wed Jan 22 1997
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:703
Total number of notes:3722

177.0. "DEADLOCKWAIT Granularity Too Coarse" by WHOS01::BOWERS (Dave Bowers, NSIS/IM) Tue Feb 11 1997 15:37

    The customer is running a high-volume, high-concurrency OLTP system and
    is being plagued by Rdb deadlocks. The DBA feels that if he could set
    DEADLOCKWAIT to a subsecond value (the system is a TurboLaser) overall
    thoughput could be improved significantly. He feels that this parameter
    (which is, of course, in whole seconds) is no longer scaled correctly
    for newer, faster processors.
    
    Is anyone considering changing the units for this parameter in future
    VMS versions?
    
    \dave
T.RTitleUserPersonal
Name
DateLines
177.1AUSS::GARSONDECcharity Program OfficeWed Feb 12 1997 21:0612
re .0
    
>    Is anyone considering changing the units for this parameter in future
>    VMS versions?
    
    future => product manager
    
    I wouldn't have thought the implied change was all that desirable.
    Surely the customer can fix the application not to deadlock so much.
    
    You might want to work out how long it takes to complete a deadlock
    search.
177.2WHOS01::BOWERSDave Bowers, NSIS/IMThu Feb 13 1997 09:4615
    future => product manager => name or e-mail addr?
    
    The main problem is that the code is generated by TI's IEF case tool
    and we really can't control it in any meaningful way. We know it's
    lousy code in many ways, but we're stuck with it.
    
    The overal system design (the other half of the problem) is likewise
    cast in concrete.
    
    The argument being made by the DBA is essentially that the 1 second
    granularity was imposed when the box ran at .1% of the speed of the
    current generation. On a 780, a second was a fairly brief interval. On
    an 8400, it's forever.
    
    \dave
177.3AUSS::GARSONDECcharity Program OfficeThu Feb 13 1997 20:2837
re .2
    
    future => product manager => name or e-mail => note 7.2 in this conference
    
    It sounds as if an IPMT should be raised for the customer.
    
>    The argument being made by the DBA is essentially that the 1 second
>    granularity was imposed when the box ran at .1% of the speed of the
>    current generation. On a 780, a second was a fairly brief interval. On
>    an 8400, it's forever.
    
    The goal of DEADLOCK_WAIT was to define a time limit beyond which a
    queued lock conversion or request was deemed likely to indicate
    deadlock rather than simply that a process holding the lock hadn't
    finished with it i.e. to control the amount of CPU on *wasted* deadlock
    searches.
    
    It is important therefore to identify what is the limiting factor on
    legitimate lock hold time. If there is *no* user interaction while
    locks are held then CPU is one factor but I/O may be another. Note that
    this includes cluster communications I/O as well as disk I/O. All of
    these have got faster but CPU has increased by the greatest factor. [If
    locks are held across user interaction then all bets are off.]
    
    At the same time, since CPUs have got faster, one can afford to perform
    more deadlock searches per unit time and expend only the same CPU time
    regardless of whether the CPU time is wasted but, on the other hand,
    lock populations have become larger and so this effect is offset somewhat.
    
    So while sub-second DEADLOCK_WAIT may be justified, perhaps not to the
    extent implied by the DBA. (I agree that 10 seconds default and 1
    second minimum looks pretty conservative for a system doing perhaps a
    thousand database transactions per second.)
    
    At the very least they should confirm that when deadlock is not
    occurring the locks are not held for a period of time exceeding what
    they would propose for DEADLOCK_WAIT.
177.4EEMELI::MOSEROrienteers do it in the bush...Wed Feb 19 1997 14:5220
    woa, I think this customer wants some more CPU cycle competition!
    
    Why does he want to 'decrease' DEADLOCK_WAIT? in order to be faster
    notified of any deadlocks, i.e. a process issuing a $ENQW getting
    quicker a SS$_DEADLOCK back?
    
    or
    
    does he just have to burn some more CPU cycles? you will trigger many
    many more deadlock searches with a low DEADLOCK_WAIT value. And those
    are very expensive and at high IPL and no locks can be granted during
    this time for anybody etc.
    
    
    Bottom line: I don't see any valid point of lowering DEADLOCK_WAIT
    below 1 sec. I occasionally lower it from the default of 10 sec down
    to maybe 3 or 5 sec and then watch the timeout queue to see on what
    resources I might have a contention...
    
    /cmos
177.5AUSS::GARSONDECcharity Program OfficeWed Feb 19 1997 17:398
re .4
    
>    does he just have to burn some more CPU cycles? you will trigger many
>    many more deadlock searches with a low DEADLOCK_WAIT value.
    
    Not unless there are locks timing out. And on the evidence submitted by
    the customer, while there would be more deadlock searches, they would
    not be wasted.
177.6EEMELI::MOSEROrienteers do it in the bush...Thu Feb 20 1997 01:3318
    re: .5
    
    I'm still not convinced. Lets say your system is busy and you have a
    contention for a certain resource. On average a lock request has to
    wait 0.7 sec for the lock to be granted. With a DEADLOCK_WAIT of 1 sec
    this means, that you'd normally wouldn't trigger a deadlock search for
    this lock request because it's removed from the timeout queue before
    the next round of check.
    
    If DEADLOCK_WAIT would be 0.5 sec you would stumble over this lock and
    trigger a search, which comes back and says "no deadlock", takes the
    lock off the queue and a fraction of a second later it's anyway gone
    because in the meantime the lock is granted.
    
    I call it waste if you have lots of deadlock searches, but no deadlock
    found numbers.
    
    /cmos
177.7AUSS::GARSONDECcharity Program OfficeThu Feb 20 1997 16:4830
re .6

>    I'm still not convinced. Lets say your system is busy and you have a
>    contention for a certain resource. On average a lock request has to
>    wait 0.7 sec for the lock to be granted.

>   If DEADLOCK_WAIT would be 0.5 sec you would stumble over this lock and
>   trigger a search, which comes back and says "no deadlock", takes the

    As I wrote in a prior reply...

"At the very least they should confirm that when deadlock is not occurring the
locks are not held for a period of time exceeding what they would propose for
DEADLOCK_WAIT."

    How realistic is 0.7 sec for an average time that a lock is held? A
    system that is supposed to complete 200 transactions per second or
    even 100 transactions per second is going to struggle if that is the
    correct average.

    Here, by transaction I mean from the start of a database transaction to
    the commit or rollback that drops all the locks. This may be less than
    a business transaction but is the relevant definition in this case.

    One needs to take into account parallelism in a system so that 100
    transactions per second does not mean each transaction lasts 0.01
    second on average. The bottom line is that the customer should measure
    the average transaction duration (or their database software can tell
    them) before making a case for sub-second DEADLOCK_WAIT but my gut feel
    is that their claim may have some merit.
177.8EEMELI::MOSEROrienteers do it in the bush...Fri Feb 21 1997 02:3714
    the problem is not necessary the hold/wait time of 'nice behaving'
    locks, but all the others.
    
    For example if I want to see which directory files are busy, I just
    lower DEADLOCK_WAIT and suddenly lots of F11B$ locks turn up, because
    too many processes try to create/delete files in large directories,
    and I can bet lots of money that those .DIR files are larger than
    127 blocks.
    
    So any of your transaction locks behaving well do not cause any
    problems, but those others can trigger deadlocks searches like hell,
    and those searches will also hurt your transaction lock requests.
    
    /cmos
177.9More infoWHOS01::BOWERSDave Bowers, NSIS/IMFri Feb 21 1997 11:2519
    I went back to the DBA for more info. I appears that most of the
    deadlocks are Rdb page locks. The real villain here is the application,
    which was written using TI's IEF product (James Martin methodology):
    
    1.  IEF is remarkably naive regarding transaction control (like it uses
    default transactions).
    
    2.  The design of the application has a primary process which writes
    rows and then passes a key to a secondary process which further
    processes and updates the row. This of course creates instant "hot
    spots" on both data and index pages as both processes contend foir
    access to the same page.
    
    Neither of the above problems is amenable to a direct fix, so we're
    looking to "tune" the system so as to minimize the mess. The good news,
    if any, is that there is only this one (abeit large) application
    running on the system.
    
    \dave
177.10EEMELI::MOSEROrienteers do it in the bush...Fri Feb 21 1997 13:1912
    do you know which Rdb pages? always the same? then it is pretty
    likely to be an application issue.
    
    If you're interested I have a tool which monitors the lock timeout
    queue and logs information for locks hanging around there too long,
    especially who they are blocking and who is waiting.
    
    For Rdb locks it will translate them to something an Rdb expert
    understands, grab and have a look at TUIJA""::LOCK033.A
    (works for VAX and Alpha and understands almost all Rdb lock types)
    
    /cmos
177.11You must fix those deadlocks, not hide themHERON::GODFRINDOracle Rdb EngineeringFri Feb 28 1997 05:1274
If I may join the conversation ...

The real problem that needs fixing is the excessive number of deadlocks that
happen. Lowering DEADLOCK_WAIT would of course make it possible for VMS to
notice those deadlocks faster, but that is at the expense of potentially severe
side effects (as Christian pointed out) in terms of additional CPU usage at
high IPL.

In addition, when an Rdb process receives a deadlock error against one of its
pending page lock requests, it will demote all other page locks it may have,
which may involve writing pages back to disk that would otherwise have been
written lazily at a later stage. Also, there is a very good chance that the
same process will need to reacquire some of those locks it gave up right after.

So, although page deadlocks in rdb cause no application failure, and even
though adjusting DEADLOCK_WAIT to a lower value will seem to make the
application more responsive, they are still bad and should be kept to the
smallest possible level.

>           <<< Note 177.9 by WHOS01::BOWERS "Dave Bowers, NSIS/IM" >>>
>    I went back to the DBA for more info. I appears that most of the
>    deadlocks are Rdb page locks. The real villain here is the application,
>    which was written using TI's IEF product (James Martin methodology):
    
I myself have done extensive tuning of databases used by applications built
using IEF (although those applications did not require that level of
performance - more like the 10 to 15 TPS range).

>    1.  IEF is remarkably naive regarding transaction control (like it uses
>    default transactions).

That is not exactly true. Some level of control is available (via logical
names) to control isolation levels and transaction mode (read only vs read
write) for the readers.

>    2.  The design of the application has a primary process which writes
>    rows and then passes a key to a secondary process which further
>    processes and updates the row. This of course creates instant "hot
>    spots" on both data and index pages as both processes contend foir
>    access to the same page.
    
>    Neither of the above problems is amenable to a direct fix, so we're
>    looking to "tune" the system so as to minimize the mess. The good news,
>    if any, is that there is only this one (abeit large) application
>    running on the system.

You may consider using alternate indexing techniques (such as hashing) to
better distribute records and index nodes and possibly avoid the contention.
Another point to watch is maybe to give up using fast commit (or to set up the
primary process so that it checkpoints after each transaction). that way it
will give up its page locks after each transaction and let the secondary
process get at the pages without locking.

Also, adapting index node sizes should have an effect on contention (smaller
nodes may help).

All this is of course highly speculative. Accurate recommendations would
require a detailed look at the application and database design. 

I recommend you get assistance from some experienced Rdb consultant. There are
quite a few available from Oracle and external sources.

Where is the customer located BTW ? Just send me mail offline and I may be able
to locate names for you ...

/albert

--
Albert Godfrind                    Oracle Rdb Engineering
Oracle Corporation                 Email:  [email protected]
DEC European Technical Center              [email protected]
950 Route des Colles               Phone:  +33/4/92.95.51.63
06901 Sophia-Antipolis             Mobile: +33/6/09.97.27.23
France                             FAX:    +33/4/92.95.50.50