[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference cookie::archive_backup

Title:	Archive/Backup

Moderator:	COOKIE::MHUAIG

Created:	Wed Sep 08 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	479
Total number of notes:	2283

354.0. "Catalog lookup performance." by CX3PST::BSS::SAUL () Thu Dec 19 1996 13:11

T.R Title User Personal
Name Date Lines

354.1 Here's some info on this, if anybody is still interested. COOKIE::LEWIS Mon Feb 10 1997 13:27 100

T.R	Title	User	Personal Name	Date	Lines
354.1	Here's some info on this, if anybody is still interested.	COOKIE::LEWIS		`Mon Feb 10 1997 13:27`	100
	If anybody is still interested in this, here's some information on catalog size and performance. First, some background. The "catalog" contains 3 main pieces. The first is the "common" AOE (Archive Object Entity), or file header type information that does not change from day to day, (such as the node name, file name, directory, date created, and so on, and a unique identifier for this file). The second is the "instance" AOE information, or information that is specific to this particular instance of the backup for this file. This information would be things like date of the backup or archive, archive expiration date, file position, and so on. It also contains the unique id for the "common" information about this file. The third is the "TLE" (transaction log entity) information. This information is information about the archive event. It contains such things as the date of the archive, the save set id, the file system root, the volume set used, and so on. It contains information that would be needed to actually do a restore from this archive or save event. These three pieces are each stored in an indexed sequential file in the directory abs$catalog. The files have the following suffixes, _BAOE.DAT (the "common" aoe info), _BAOE_INSNC.DAT (the "instance" aoe info), and _BTLE.DAT (the TLE info). So, every time we backup a file (an object), we create an entry in the AOE "instance" file, and the first time we backup a file we also create an entry in the "common" AOE. Each of these files have several keys so they can be tied together, and quickly searched for specific items. The TLE file is written once for each backup operation. It also has keys to make it easy to find specific information, if the desired information is known. Now, lets say we want to do a lookup on a particular file. For a particular file we look in the "common" AOE file using the filename as the key. We find the record, usually one, but there could be multiple if the same file existed on multiple systems that are saved in this catalog, or version numbers existed for the file. We pick up the Unique ID for this file, and use it to lookup values in the instance file. There may be multiple entries, one for each time this file has been backed up. Once we find the specific instance record, usually based on dates matching, we use the save set id as a key to find the record in the TLE file. Then we have the volume information (tape id) we need to restore the file. From the above information, we can see that if we are backing up a large number of files, we may have many entries in the "common" AOE file, especially if the file names are constantly changing. In addition, we will have a record in the "instance" AOE file for each time we backup each file. (So, if we backup 10 files for 10 days, we will have 100 entries in the file, maybe less if we are doing incrementals, and the files have not changed). So, now for the questions in the base note. First, wild card lookups, why do they take so long. The time for a wild card lookup is a function of the size of the catalog, and the method we use for returning the results of the lookup. The size of the catalog is determined primarily by the number of files that are backed up. That is, a catalog for a system with 1 million files will be larger than a catalog for a system with only 10,000 files, if other things are the same. With a wild card lookup, we must essentially read the entire "common" AOE file to see which filenames match the wildcard characteristics specified by the user. We then have to find the corresponding "instance" AOE entries. So, the bigger the files, the longer it takes. Now, there is also the "method" that was chosen for returning the entries to the user. The method that was chosen is to go through the files, build a giant structure that contains all of the matching entries, and then return the entries one at a time to the "client" side. So, before we return the first entry, we find them all, and this takes time. There are some historical reasons why this method was chosen, and although it may not be the best for performance in a wildcard search, it has some simplicity for other types of queries where specific information will be returned, and there won't be so much information to return. We are currently examining this, and are continuously looking for ways to improve the catalog performance at both the time of insertion, and at the time of lookup. In the meantime, since the three catalog files are simply Indexed Sequential RMS files, you can use standard RMS tools, (like analyze RMS) to help tune the files. For example, looking at things like extents, bucket splits, index levels, and so on. Convert could then be used to create new files that might improve performance for the individual customer. The second question, size of the catalog files. This is really just determined by the number of files they are backing up, the number of times they do the backups, and whether they do full backups or incremental backups. In addition, if they are using a staging catalog, there is a brief period of time while the staging catalog entries are being added to the main catalog, where the user will have the size of the staging file, and the increased size of the main catalog file. Once the staging update is complete, the staging file will be deleted, so the file space will be returned. Hope this helps. Jim and Kim, ABS engineering

If anybody is still interested in this, here's some information on catalog
size and performance.

First, some background.

The "catalog" contains 3 main pieces.
    The first is  the "common" AOE (Archive Object Entity), or file header
type information that does not change from day to day, (such as the 
node name, file name, directory, date created, and so on, and a unique 
identifier for this file).
    The second is the "instance" AOE information, or information that is
specific to this particular instance of the backup for this file.  This 
information would be things like date of the backup or archive, archive 
expiration date, file position, and so on.  It also contains the unique id
for the "common" information about this file.
    The third is the "TLE" (transaction log entity) information.  This
information is information about the archive event.  It contains such things
as the date of the archive, the save set id, the file system root, 
the volume set used, and so on.  It contains information that would be 
needed to actually do a restore from this archive or save event.

These three pieces are each stored in an indexed sequential file in the
directory abs$catalog.  The files have the following suffixes, 
_BAOE.DAT (the "common" aoe info), _BAOE_INSNC.DAT (the "instance" aoe info),
and _BTLE.DAT (the TLE info).

So, every time we backup a file (an object), we create an entry in the
AOE "instance" file, and the first time we backup a file we also create an
entry in the "common" AOE.  Each of these files have several keys so they
can be tied together, and quickly searched for specific items.

The TLE file is written once for each backup operation.  It also has keys
to make it easy to find specific information, if the desired information is
known.  

Now, lets say we want to do a lookup on a particular file.  For a particular
file we look in the "common" AOE file using the filename as the key.  We 
find the record, usually one, but there could be multiple if the same file 
existed on multiple systems that are saved in this catalog, or version 
numbers existed for the file.  We pick up the Unique ID for this file, and 
use it to lookup values in the instance file.  There may be multiple entries,
one for each time this file has been backed up.  Once we find the specific
instance record, usually based on dates matching, we use the save set id
as a key to find the record in the TLE file.  Then we have the volume 
information (tape id) we need to restore the file.

From the above information, we can see that if we are backing up a large
number of files, we may have many entries in the "common" AOE file, 
especially if the file names are constantly changing.  In addition,
we will have a record in the "instance" AOE file for each time we backup 
each file.  (So, if we backup 10 files for 10 days, we will have 100 
entries in the file, maybe less if we are doing incrementals, and 
the files have not changed).

So, now for the questions in the base note.

First, wild card lookups, why do they take so long.

The time for a wild card lookup is a function of the size of the catalog, and
the method we use for returning the results of the lookup.  The size of the 
catalog is determined primarily by the number of files that are backed up.
That is, a catalog for a system with 1 million files will be larger than
a catalog for a system with only 10,000 files, if other things are the same.
With a wild card lookup, we must essentially read the entire "common" AOE file
to see which filenames match the wildcard characteristics specified by the
user.  We then have to find the corresponding "instance" AOE entries.  So,
the bigger the files, the longer it takes.  Now, there is also the "method"
that was chosen for returning the entries to the user.  The method that was
chosen is to go through the files, build a giant structure that contains
all of the matching entries, and then return the entries one at a time to
the "client" side.  So, before we return the first entry, we find them all,
and this takes time.

There are some historical reasons why this method was
chosen, and although it may not be the best for performance in a wildcard
search, it has some simplicity for other types of queries where specific
information will be returned, and there won't be so much information to return.
We are currently examining this, and are continuously looking for ways to
improve the catalog performance at both the time of insertion, and at the
time of lookup.

In the meantime, since the three catalog files are simply Indexed Sequential
RMS files, you can use standard RMS tools, (like analyze RMS) to help tune 
the files.   For example, looking at things like extents, bucket splits, index
levels, and so on.  Convert could then be used to create new files that might
improve performance for the individual customer.

The second question, size of the catalog files.  This is really just determined
by the number of files they are backing up, the number of times they do the
backups, and whether they do full backups or incremental backups.  In addition,
if they are using a staging catalog, there is a brief period of time while the
staging catalog entries are being added to the main catalog, where the user
will have the size of the staging file, and the increased size of the main
catalog file.  Once the staging update is complete, the staging file will be
deleted, so the file space will be returned.

Hope this helps.

Jim and Kim, ABS engineering