[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference orarep::nomahs::dec_data_distributor

Title:	The Replication Option for Rdb
Notice:	Product renamed to Replication Option for Rdb
Moderator:	BROKE::PROTEAU

Created:	Wed Mar 02 1994
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	287
Total number of notes:	1231

270.0. "performance degradation when Replication is used" by 8292::PJACOB (Patrick [email protected]) Tue Mar 11 1997 09:11

An important customer of mine experiences performance problems with Replication
option for Rdb V6.0 with Rdb 6.0-1. Every 2 days, response time becomes 
intolerable due to locks, deadlocks and excessive I/Os caused by DDD.

Replication is organised as follows: an Rdb database on VAX is replicated on
an Rdb database on Alpha. Every 5 minutes on the Alpha ( except 1 time by day
during 1 hour to achieve an non-interruptible task ), a remote process triggers
an SQL START TRANSFER transfer-name NOW NOWAIT on the VAX. The REPLICATION 
transfer is a MOVE TABLES transfer of 74 tables moved by SELECT * .
Without defining DDAL$CHANGES_INDEX to USE, it was totally unusable. After
defining DDAL$CHANGES_INDEX, performance start at a correct level but is
getting worse and worse. DDAL$PURGE is not defined. There is about 1000 
transactions by hours : 640 updates + 120 inserts + 40 deletes + 200 reads
Tables are not compressed. 

The customer has a spy program which retrieves time and resource consumption. He
detected that times and I/Os numbers are getting greater and greater until he
stopped DDD activity by ALTER TRANSFER + DROP TRANSFER. He noted also with
RMU/ANALYZE that number of pages in RDB$CHANGES increased on and on : chart
used/used+free vs # pages shows 0 data records for 52890 pages at 0-10% 

I am not a ROP expert thus any help will be welcomed.

Patrick

T.R	Title	User	Personal Name	Date	Lines
270.1		HOTRDB::PMEAD	Paul, [email protected], 719-577-8032	`Wed Mar 12 1997 09:09`	1
	I suggest upgrading to V7.0 and using Hot Standby.
270.2		ORAREP::HERON::GODFRIND	Oracle Rdb Engineering	`Thu Mar 13 1997 02:54`	49
	When you say that performance degrades, do you mean the applications that use the source database ? Or is it the copy process itself? >getting worse and worse. DDAL$PURGE is not defined. There is about 1000 What is the default action when DDAL$PURGE is not defined ? The documentation is not clear (it does not say what the default action is). I would expect it to be to purge transfered rows after each transfer ... If the changes table did not get purge regularly, then that would explain the fact that it grows a lot. The logs produduced by the copy processes should indicate wether purging takes place or not. >An important customer of mine experiences performance problems with Replication >option for Rdb V6.0 with Rdb 6.0-1. Every 2 days, response time becomes >intolerable due to locks, deadlocks and excessive I/Os caused by DDD. What do they do to return to good performence ? Where do they get deadlock errors ? In their application code ? Against what resource ? Or is it the copy processes that get deadlocks ? >Replication is organised as follows: an Rdb database on VAX is replicated on >an Rdb database on Alpha. Every 5 minutes on the Alpha ( except 1 time by day >during 1 hour to achieve an non-interruptible task ), a remote process triggers >an SQL START TRANSFER transfer-name NOW NOWAIT on the VAX. The REPLICATION So you mean they do not have any schedule defined for that transfer (i.e. the scheduling is not controlled by the transfer monitor, but entirely by the application ?). How long does noe transfer operation take ? Could it be that multiple copy processes could run at the same time (assuming a transfer lasts more than 5 minutes [not sure if this is allowed at all by the transfer monitor]) Another point: you say there is a one hour period every day when no transfer activity takes place. This means that the changes for that one hour period will accumulate in the changes log. That has two consequences: - the first transfer to run after that hour will have much more to do. Could it be that it is still running when the application requests the next transfer, 5 minutes later ? Not sure what that would cause ... - the changes table will grow to accomodate all the rows accumulated during that hour. Even if they are all purged, the table itself will not shrink (i.e. it will still occupy the same number of pages). Subsequent scans of the changes table will take longer. The table will never shrink on its own. /albert
270.3	Default purging	BROKE::GREEN		`Thu Mar 13 1997 09:55`	21
	I'll take a stab at the default purging behavior, and if I'm wrong someone will let me know I am sure. 8^) When a replication update transfer completes it somehow (don't know how technically) checks to see if another replication update transfer is executing. If there is another transfer in progress then the completed transfer does not attempt to purge RDB$CHANGES. If there is no other active transfer then the completed transfer purges RDB$CHANGES. The process reads from the source table (RDB$TRANSFERS) and gets the lowest TSER value in the table. It will then purge up to that lowest TSER value. So, if you happened to have a transfer which is never run anymore you should drop that transfer, or else you'll never purge beyond the dormant transfer's TSER value. There are logicals that can be defined to change the default purging behavior (section B-13 of the V7 Rdb Replication Option Handbook). Again, someone please let me know if I'm wrong in my assumptions! Don
270.4	problems occur when RDB$CHANGES is filled	8292::PJACOB	Patrick [email protected]	`Fri Mar 14 1997 10:17`	78
	Hi Paul , Albert and Don, here is some additional information: > I suggest upgrading to V7.0 and using Hot Standby. If replication does not work, they will investigate this possibility. Previously he used a solution based on AIJ application on a clone database. But they also experienced [other] problems ( at least bug 376068 ). Therefore, they abandonned this option to adopt replication option. >When you say that performance degrades, do you mean the applications that use >the source database ? Or is it the copy process itself? It is the application which is impacted by the degradation. The transfer process and the purge process still continue to execute within acceptable range of time. >What is the default action when DDAL$PURGE is not defined ? The documentation >is not clear (it does not say what the default action is). I would expect it to >be to purge transfered rows after each transfer ... Replication Option for Rdb Handbook release 6.0 appendix B specify that if DDAL$PURGE is not defined or defined to something which does not begin by "N" or "n", the default is to purge. The customer checked in log file that purges are done. >What do they do to return to good performence ? > >Where do they get deadlock errors ? In their application code ? Against what >resource ? Or is it the copy processes that get deadlocks ? Maybe I was not clear enough on this points. Let me describe another way. They experienced performance problem on the application which accessed the Rdb database they want to replicate. They noted that some transactions which used to take about a hundred I/Os took 16000 I/Os. They noted also a lot of locks and a lot of deadlocks. RMU/SHOW LOCKS/MODE=BLOCKING revealed that DDD index was blocking. To workaround the performance problem, they simply dropped the transfer( DROP TRANSFER ); and performances came back to reasonable level but there is no more replication ! >So you mean they do not have any schedule defined for that transfer (i.e. the >scheduling is not controlled by the transfer monitor, but entirely by the >application ?). Yes, this is exactly what I meant. >How long does noe transfer operation take ? Could it be that multiple copy >processes could run at the same time (assuming a transfer lasts more than 5 >minutes [not sure if this is allowed at all by the transfer monitor]) Transfer + purge takes around 15 seconds . This is << 5 minutes. >- the first transfer to run after that hour will have much more to do. Could it > be that it is still running when the application requests the next transfer, > 5 minutes later ? Not sure what that would cause ... The first transfer after the interruption takes about 3 minutes; that is it takes less than the 5 minutes before the next transfer. >- the changes table will grow to accomodate all the rows accumulated during > that hour. Even if they are all purged, the table itself will not shrink > (i.e. it will still occupy the same number of pages). Subsequent scans of > the changes table will take longer. The table will never shrink on its own. But shouldn't the pages be reused once the purge has completed ? Patrick
270.5		ORAREP::HERON::GODFRIND	Oracle Rdb Engineering	`Mon Mar 17 1997 06:51`	58
	>> I suggest upgrading to V7.0 and using Hot Standby. > > If replication does not work, they will investigate this possibility. > Previously he used a solution based on AIJ application on a clone > database. But they also experienced [other] problems ( at least bug > 376068 ). Therefore, they abandonned this option to adopt replication > option. If they need a full clone of the database (i.e. strictly identical down to the physical level), then the hot standby is the best option. > Maybe I was not clear enough on this points. Let me describe another > way. They experienced performance problem on the application which > accessed the Rdb database they want to replicate. They noted that > some transactions which used to take about a hundred I/Os took 16000 > I/Os. They noted also a lot of locks and a lot of deadlocks. RMU/SHOW > LOCKS/MODE=BLOCKING revealed that DDD index was blocking. To workaround > the performance problem, they simply dropped the transfer( DROP > TRANSFER ); and performances came back to reasonable level but there > is no more replication ! OK. That makes it clearer. Did you/they investigate those extra I/Os ? Are they against rdb$system ? Is there a high ratio of pages checked vs pages stored ? >>- the changes table will grow to accomodate all the rows accumulated during >> that hour. Even if they are all purged, the table itself will not shrink >> (i.e. it will still occupy the same number of pages). Subsequent scans of >> the changes table will take longer. The table will never shrink on its own. > > But shouldn't the pages be reused once the purge has completed ? Not necessarily immediately. The changes table is just like any other uniform table. The copy process will delete the rows (via the purge action) and free up space on the pages, but: - other connected processes will not notice and keep on storing rows "at the end" of the logical area, so it will keep expanding. This is the expected behavior when storing records in a uniform area: the "target page" pointer moves forward; it will go back only if the process deletes some rows "earlier" in the table (but that never happens with the changes table, since the copy processes do all the purging), or if the process disconnects and reconnects. - even though the pages are empty, they are still allocated to the changes table, so any operation that requires scanning the whole table will have to read all those pages (even the empty ones). I don't know if any of this explains what the customer sees. Looks like there is not too much data to transfer, so the changes table should never be too large. Maybe it did grow very large at some point, as a result of some reorganisation; You can try physically shrinking it. Check out the RMU/REPAIR /INITIALIZE=FREE_PAGES command. How is the application structured in terms of processes ? Do the updating processes stay attached for very long intervals ? Or do they connect and disconnect frequently ? /albert
270.6		NOVA::SMITHI	Don't understate or underestimate Rdb!	`Mon Mar 17 1997 10:16`	9
	~- even though the pages are empty, they are still allocated to the changes ~table, so any operation that requires scanning the whole table will have to ~read all those pages (even the empty ones). To explain a little more. While we know that the page is strictly empty, the data structures only record if the page is not-full. i.e. a not full page may contain data. Ian
270.7	Observations	BROKE::PROTEAU	Jean-Claude Proteau	`Mon Mar 24 1997 12:26`	103
	Some comments on the base note and its replies plus answers to a few of the questions: 1. Transaction Load OK 2. Replication Frequency OK 3. Beware Long-Running Applications and Db Key Scope 4. Copy Process Protected Read Can Cause Stalls 5. Beware Indexes on RDB$CHANGES 6. Purging Can Cause a Lot of Locking 7. Transfer Won't Rerun While Running 8. Concurrent Transfers Use a Purge Lock 1. Transaction Load OK Your transaction load seems to be well within the capability of ROR to handle. I conclude this from the time you say it takes for a transfer to execute. 2. Replication Frequency OK The transfer repetition rate of once every five minutes also seems to be ok. Transfers are able to complete well before the next is due to execute. 3. Beware Long-Running Applications and Db Key Scope There are three reasons we know of that can cause the size of the RDB$CHANGES table to grow large: a. A long-running transaction with many inserts, updates and deletes b. An application that stays attached to the database indefinitely c. An application that uses database key scope is attach An application that stays attached indefinitely never sees any of the space freed up by the purging process. It always allocates space ahead of itself. Eventually it bumps into the end of the table space and the RDB$CHANGES table has to be extended. I add this comment from Ian Smith, "I believe the current search for FREE space looks forward in the current area, and extends if needed. The assumption is that it has already scanned from the start and found no free space. So scanning again seems fruitless in the general case. However, the freed space will be used by a future attach." Whenever there is at least one application running with database key scope is attach, database keys for deleted rows cannot be reclaimed. Space in the data- base page for the database key's line index and TSN index remain allocated. This reduces the space available on the page and eventually there might not be enough space there to add a new row for the RDB$CHANGES table. Even though none of the applications might be considered to be long-running, so long as their execution overlaps so that there is always one with database key scope is at- tach, this problem will continue to exist. 4. Copy Process Protected Read Can Cause Stalls When a copy process first starts a replication update run, it starts two trans- actions in separate attachments to the source database. First it does a pro- tected read on RDB$CHANGES. Once that transaction starts, it does a read only transaction and rolls back the first one. The protected read forces a quiet point where there are no active writers to the RDB$CHANGES table. If the copy process has to wait for some applicaiton to commit its read write transaction, any other applications that try to start a read write transaction afterwards and write to RDB$CHANGES will also be delayed. It was said that the transfers run every five minutes and complete in about 15 seconds. The transfers therefore run about 5% of the time. None of this seems likely to cause any serious delays to the other applcations or cause them to timeout (unles you have a very short timeout interval defined). The protected read behavior can be disabled. There is a logical that can be defined to do so. However, I don't recommend this be used unless you meet the requirements stated in the manual. Otherwise you will encounter periodic problems caused by transfers missing updates. We are considering ways to elim- inate the need to do a protected read, but for now it remains a fact of life. 5. Beware Indexes on RDB$CHANGES Indexing the RDB$CHANGES table does indeed dramatically improve the execution speed of transfers. But it can lead to a lot of locking problems for concur- rently-executing applcations. One is better off, I feel, keeping the size of the RDB$CHANGES table down. 6. Purging Can Cause a Lot of Locking When a copy process purges the RDB$CHNAGES table, it does so using a read write transaction. First it reads all the rows in the table, sorts them, and then decides which rows to delete. This can cause a lot of locking while the purging is underway. Improvements to the purging process are being implemented now. Since the changes are extensive, they will only be made available to the latest version of ROR. 7. Transfer Won't Rerun While Running Someone asked what happens if a transfer takes longer to run than usual so that it is still running when someone issues a request to rerun it. The answer is, nothing. If a transfer is already running and you ask to run it, the request will be ignored. 8. Concurrent Transfers Use a Purge Lock Someone asked how concurrently-executing transfer keep from purging the RDB$CHANGES table in parallel. They do so by unsing a cluster-wide lock. Only one transfer can obtain the lock at any given time. The one which gets the lock does the purging. Any others will give up right away, noting that there is no point in having parallel purging taking place.