[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference orarep::nomahs::dec_data_distributor

Title:The Replication Option for Rdb
Notice:Product renamed to Replication Option for Rdb
Moderator:BROKE::PROTEAU
Created:Wed Mar 02 1994
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:287
Total number of notes:1231

270.0. "performance degradation when Replication is used" by 8292::PJACOB (Patrick [email protected]) Tue Mar 11 1997 09:11

An important customer of mine experiences performance problems with Replication
option for Rdb V6.0 with Rdb 6.0-1. Every 2 days, response time becomes 
intolerable due to locks, deadlocks and excessive I/Os caused by DDD.

Replication is organised as follows: an Rdb database on VAX is replicated on
an Rdb database on Alpha. Every 5 minutes on the Alpha ( except 1 time by day
during 1 hour to achieve an non-interruptible task ), a remote process triggers
an SQL START TRANSFER transfer-name NOW NOWAIT on the VAX. The REPLICATION 
transfer is a MOVE TABLES transfer of 74 tables moved by SELECT * .
Without defining DDAL$CHANGES_INDEX to USE, it was totally unusable. After
defining DDAL$CHANGES_INDEX, performance start at a correct level but is
getting worse and worse. DDAL$PURGE is not defined. There is about 1000 
transactions by hours : 640 updates + 120 inserts + 40 deletes + 200 reads
Tables are not compressed. 

The customer has a spy program which retrieves time and resource consumption. He
detected that times and I/Os numbers are getting greater and greater until he
stopped DDD activity by ALTER TRANSFER + DROP TRANSFER. He noted also with
RMU/ANALYZE that number of pages in RDB$CHANGES increased on and on : chart
used/used+free vs # pages shows 0 data records for 52890 pages at 0-10% 

I am not a ROP expert thus any help will be welcomed.

Patrick
T.RTitleUserPersonal
Name
DateLines
270.1HOTRDB::PMEADPaul, [email protected], 719-577-8032Wed Mar 12 1997 09:091
    I suggest upgrading to V7.0 and using Hot Standby.
270.2ORAREP::HERON::GODFRINDOracle Rdb EngineeringThu Mar 13 1997 02:5449
When you say that performance degrades, do you mean the applications that use
the source database ? Or is it the copy process itself?

>getting worse and worse. DDAL$PURGE is not defined. There is about 1000 

What is the default action when DDAL$PURGE is not defined ? The documentation
is not clear (it does not say what the default action is). I would expect it to
be to purge transfered rows after each transfer ... 

If the changes table did not get purge regularly, then that would explain the
fact that it grows a lot. The logs produduced by the copy processes should
indicate wether purging takes place or not.

>An important customer of mine experiences performance problems with Replication
>option for Rdb V6.0 with Rdb 6.0-1. Every 2 days, response time becomes 
>intolerable due to locks, deadlocks and excessive I/Os caused by DDD.

What do they do to return to good performence ?

Where do they get deadlock errors ? In their application code ? Against what
resource ? Or is it the copy processes that get deadlocks ?

>Replication is organised as follows: an Rdb database on VAX is replicated on
>an Rdb database on Alpha. Every 5 minutes on the Alpha ( except 1 time by day
>during 1 hour to achieve an non-interruptible task ), a remote process triggers
>an SQL START TRANSFER transfer-name NOW NOWAIT on the VAX. The REPLICATION 

So you mean they do not have any schedule defined for that transfer (i.e. the
scheduling is not controlled by the transfer monitor, but entirely by the
application ?). 

How long does noe transfer operation take ? Could it be that multiple copy
processes could run at the same time (assuming a transfer lasts more than 5
minutes [not sure if this is allowed at all by the transfer monitor])

Another point: you say there is a one hour period every day when no transfer
activity takes place. This means that the changes for that one hour period will
accumulate in the changes log. That has two consequences:

- the first transfer to run after that hour will have much more to do. Could it
  be that it is still running when the application requests the next transfer,
  5 minutes later ? Not sure what that would cause ...

- the changes table will grow to accomodate all the rows accumulated during
  that hour. Even if they are all purged, the table itself will not shrink
  (i.e. it will still occupy the same number of pages). Subsequent scans of
  the changes table will take longer. The table will never shrink on its own.

/albert
270.3Default purgingBROKE::GREENThu Mar 13 1997 09:5521
    I'll take a stab at the default purging behavior, and if I'm wrong
    someone will let me know I am sure.     8^)
    
    When a replication update transfer completes it somehow (don't know how
    technically) checks to see if another replication update transfer is
    executing.  If there is another transfer in progress then the completed
    transfer does not attempt to purge RDB$CHANGES.  If there is no other
    active transfer then the completed transfer purges RDB$CHANGES.
    
    The process reads from the source table (RDB$TRANSFERS) and gets the
    lowest TSER value in the table.  It will then purge up to that lowest
    TSER value.  So, if you happened to have a transfer which is never run
    anymore you should drop that transfer, or else you'll never purge
    beyond the dormant transfer's TSER value.
    
    There are logicals that can be defined to change the default purging
    behavior (section B-13 of the V7 Rdb Replication Option Handbook).
                                               
    Again, someone please let me know if I'm wrong in my assumptions!
    
    Don
270.4problems occur when RDB$CHANGES is filled8292::PJACOBPatrick [email protected]Fri Mar 14 1997 10:1778
Hi Paul , Albert and Don,

here is some additional information:

>    I suggest upgrading to V7.0 and using Hot Standby.

	If replication does not work, they will investigate this possibility.
	Previously he used a solution based on AIJ application on a clone
	database. But they also experienced [other] problems ( at least bug
	376068 ). Therefore, they abandonned this option to adopt replication 
	option.


>When you say that performance degrades, do you mean the applications that use
>the source database ? Or is it the copy process itself?

	It is the application which is impacted by the degradation. The
	transfer process and the purge process still continue to execute within
	acceptable range of time.


>What is the default action when DDAL$PURGE is not defined ? The documentation
>is not clear (it does not say what the default action is). I would expect it to
>be to purge transfered rows after each transfer ... 

	Replication Option for Rdb Handbook release 6.0 appendix B specify that
	if DDAL$PURGE is not defined or defined to something which does not 
	begin by "N" or "n", the default is to purge. The customer checked in 
	log file that purges are done.
 

>What do they do to return to good performence ?
>
>Where do they get deadlock errors ? In their application code ? Against what
>resource ? Or is it the copy processes that get deadlocks ?

        Maybe I was not clear enough on this points. Let me describe another
	way. They experienced performance problem on the application which 
	accessed the Rdb database they want to replicate. They noted that 
	some transactions which used to take about a hundred I/Os took 16000
	I/Os. They noted also a lot of locks and a lot of deadlocks. RMU/SHOW 
	LOCKS/MODE=BLOCKING revealed that DDD index was blocking. To workaround 
	the performance problem, they simply dropped the transfer( DROP
	TRANSFER ); and performances came back to reasonable level but there
	is no more replication ! 


>So you mean they do not have any schedule defined for that transfer (i.e. the
>scheduling is not controlled by the transfer monitor, but entirely by the
>application ?). 

	Yes, this is exactly what I meant.


>How long does noe transfer operation take ? Could it be that multiple copy
>processes could run at the same time (assuming a transfer lasts more than 5
>minutes [not sure if this is allowed at all by the transfer monitor])

	Transfer + purge takes around 15 seconds . This is << 5 minutes.


>- the first transfer to run after that hour will have much more to do. Could it
>  be that it is still running when the application requests the next transfer,
>  5 minutes later ? Not sure what that would cause ...

	The first transfer after the interruption takes about 3 minutes; that
	is it takes less than the 5 minutes before the next transfer.


>- the changes table will grow to accomodate all the rows accumulated during
>  that hour. Even if they are all purged, the table itself will not shrink
>  (i.e. it will still occupy the same number of pages). Subsequent scans of
>  the changes table will take longer. The table will never shrink on its own.

	But shouldn't the pages be reused once the purge has completed ? 


Patrick
270.5ORAREP::HERON::GODFRINDOracle Rdb EngineeringMon Mar 17 1997 06:5158
>>    I suggest upgrading to V7.0 and using Hot Standby.
>
>        If replication does not work, they will investigate this possibility.
>        Previously he used a solution based on AIJ application on a clone
>        database. But they also experienced [other] problems ( at least bug
>        376068 ). Therefore, they abandonned this option to adopt replication 
>        option.

If they need a full clone of the database (i.e. strictly identical down to the
physical level), then the hot standby is the best option.


>        Maybe I was not clear enough on this points. Let me describe another
>        way. They experienced performance problem on the application which 
>        accessed the Rdb database they want to replicate. They noted that 
>        some transactions which used to take about a hundred I/Os took 16000
>        I/Os. They noted also a lot of locks and a lot of deadlocks. RMU/SHOW 
>        LOCKS/MODE=BLOCKING revealed that DDD index was blocking. To workaround 
>        the performance problem, they simply dropped the transfer( DROP
>        TRANSFER ); and performances came back to reasonable level but there
>        is no more replication ! 

OK. That makes it clearer. Did you/they investigate those extra I/Os ? Are they
against rdb$system ? Is there a high ratio of pages checked vs pages stored ?

>>- the changes table will grow to accomodate all the rows accumulated during
>>  that hour. Even if they are all purged, the table itself will not shrink
>>  (i.e. it will still occupy the same number of pages). Subsequent scans of
>>  the changes table will take longer. The table will never shrink on its own.
>
>        But shouldn't the pages be reused once the purge has completed ? 

Not necessarily immediately. The changes table is just like any other uniform
table. The copy process will delete the rows (via the purge action) and free up
space on the pages, but:

- other connected processes will not notice and keep on storing rows "at the
end" of the logical area, so it will keep expanding. This is the expected
behavior when storing records in a uniform area: the "target page" pointer
moves forward; it will go back only if the process deletes some rows "earlier"
in the table (but that never happens with the changes table, since the copy
processes do all the purging), or if the process disconnects and reconnects.

- even though the pages are empty, they are still allocated to the changes
table, so any operation that requires scanning the whole table will have to
read all those pages (even the empty ones).

I don't know if any of this explains what the customer sees. Looks like there
is not too much data to transfer, so the changes table should never be too
large. Maybe it did grow very large at some point, as a result of some 
reorganisation; You can try physically shrinking it. Check out the RMU/REPAIR
/INITIALIZE=FREE_PAGES command.

How is the application structured in terms of processes ? Do the updating
processes stay attached for very long intervals ? Or do they connect and
disconnect frequently ?

/albert
270.6NOVA::SMITHIDon&#039;t understate or underestimate Rdb!Mon Mar 17 1997 10:169
~- even though the pages are empty, they are still allocated to the changes
~table, so any operation that requires scanning the whole table will have to
~read all those pages (even the empty ones).

To explain a little more.  While we know that the page is strictly empty, the
data structures only record if the page is not-full.  i.e. a not full page may
contain data.

Ian
270.7ObservationsBROKE::PROTEAUJean-Claude ProteauMon Mar 24 1997 12:26103
Some comments on the base note and its replies plus answers to a few of the
questions:

    1.  Transaction Load OK
    2.  Replication Frequency OK
    3.  Beware Long-Running Applications and Db Key Scope
    4.  Copy Process Protected Read Can Cause Stalls
    5.  Beware Indexes on RDB$CHANGES
    6.  Purging Can Cause a Lot of Locking
    7.  Transfer Won't Rerun While Running
    8.  Concurrent Transfers Use a Purge Lock


1.  Transaction Load OK

Your transaction load seems to be well within the capability of ROR to handle.
I conclude this from the time you say it takes for a transfer to execute.

2.  Replication Frequency OK

The transfer repetition rate of once every five minutes also seems to be ok.
Transfers are able to complete well before the next is due to execute.

3.  Beware Long-Running Applications and Db Key Scope

There are three reasons we know of that can cause the size of the RDB$CHANGES
table to grow large:

    a.  A long-running transaction with many inserts, updates and deletes
    b.  An application that stays attached to the database indefinitely
    c.  An application that uses database key scope is attach

An application that stays attached indefinitely never sees any of the space
freed up by the purging process.  It always allocates space ahead of itself.
Eventually it bumps into the end of the table space and the RDB$CHANGES table
has to be extended.  I add this comment from Ian Smith, "I believe the current
search for FREE space looks forward in the current area, and extends if needed.
The assumption is that it has already scanned from the start and found no free
space.  So scanning again seems fruitless in the general case.  However, the
freed space will be used by a future attach."

Whenever there is at least one application running with database key scope is
attach, database keys for deleted rows cannot be reclaimed.  Space in the data-
base page for the database key's line index and TSN index remain allocated.
This reduces the space available on the page and eventually there might not be
enough space there to add a new row for the RDB$CHANGES table.  Even though none
of the applications might be considered to be long-running, so long as their
execution overlaps so that there is always one with database key scope is at-
tach, this problem will continue to exist.

4.  Copy Process Protected Read Can Cause Stalls

When a copy process first starts a replication update run, it starts two trans-
actions in separate attachments to the source database.  First it does a pro-
tected read on RDB$CHANGES.  Once that transaction starts, it does a read only
transaction and rolls back the first one.  The protected read forces a quiet
point where there are no active writers to the RDB$CHANGES table.  If the copy
process has to wait for some applicaiton to commit its read write transaction,
any other applications that try to start a read write transaction afterwards
and write to RDB$CHANGES will also be delayed.

It was said that the transfers run every five minutes and complete in about
15 seconds.  The transfers therefore run about 5% of the time.  None of this
seems likely to cause any serious delays to the other applcations or cause
them to timeout (unles you have a very short timeout interval defined).

The protected read behavior can be disabled.  There is a logical that can be
defined to do so.  However, I don't recommend this be used unless you meet the
requirements stated in the manual.  Otherwise you will encounter periodic
problems caused by transfers missing updates.  We are considering ways to elim-
inate the need to do a protected read, but for now it remains a fact of life.

5.  Beware Indexes on RDB$CHANGES

Indexing the RDB$CHANGES table does indeed dramatically improve the execution
speed of transfers.  But it can lead to a lot of locking problems for concur-
rently-executing applcations.  One is better off, I feel, keeping the size of
the RDB$CHANGES table down.

6.  Purging Can Cause a Lot of Locking

When a copy process purges the RDB$CHNAGES table, it does so using a read write
transaction.  First it reads all the rows in the table, sorts them, and then
decides which rows to delete.  This can cause a lot of locking while the purging
is underway.  Improvements to the purging process are being implemented now.
Since the changes are extensive, they will only be made available to the latest
version of ROR.

7.  Transfer Won't Rerun While Running

Someone asked what happens if a transfer takes longer to run than usual so that
it is still running when someone issues a request to rerun it.  The answer is,
nothing.  If a transfer is already running and you ask to run it, the request
will be ignored.

8.  Concurrent Transfers Use a Purge Lock

Someone asked how concurrently-executing transfer keep from purging the
RDB$CHANGES table in parallel.  They do so by unsing a cluster-wide lock.  Only
one transfer can obtain the lock at any given time.  The one which gets the
lock does the purging.  Any others will give up right away, noting that there
is no point in having parallel purging taking place.