| If anybody is still interested in this, here's some information on catalog
size and performance.
First, some background.
The "catalog" contains 3 main pieces.
The first is the "common" AOE (Archive Object Entity), or file header
type information that does not change from day to day, (such as the
node name, file name, directory, date created, and so on, and a unique
identifier for this file).
The second is the "instance" AOE information, or information that is
specific to this particular instance of the backup for this file. This
information would be things like date of the backup or archive, archive
expiration date, file position, and so on. It also contains the unique id
for the "common" information about this file.
The third is the "TLE" (transaction log entity) information. This
information is information about the archive event. It contains such things
as the date of the archive, the save set id, the file system root,
the volume set used, and so on. It contains information that would be
needed to actually do a restore from this archive or save event.
These three pieces are each stored in an indexed sequential file in the
directory abs$catalog. The files have the following suffixes,
_BAOE.DAT (the "common" aoe info), _BAOE_INSNC.DAT (the "instance" aoe info),
and _BTLE.DAT (the TLE info).
So, every time we backup a file (an object), we create an entry in the
AOE "instance" file, and the first time we backup a file we also create an
entry in the "common" AOE. Each of these files have several keys so they
can be tied together, and quickly searched for specific items.
The TLE file is written once for each backup operation. It also has keys
to make it easy to find specific information, if the desired information is
known.
Now, lets say we want to do a lookup on a particular file. For a particular
file we look in the "common" AOE file using the filename as the key. We
find the record, usually one, but there could be multiple if the same file
existed on multiple systems that are saved in this catalog, or version
numbers existed for the file. We pick up the Unique ID for this file, and
use it to lookup values in the instance file. There may be multiple entries,
one for each time this file has been backed up. Once we find the specific
instance record, usually based on dates matching, we use the save set id
as a key to find the record in the TLE file. Then we have the volume
information (tape id) we need to restore the file.
From the above information, we can see that if we are backing up a large
number of files, we may have many entries in the "common" AOE file,
especially if the file names are constantly changing. In addition,
we will have a record in the "instance" AOE file for each time we backup
each file. (So, if we backup 10 files for 10 days, we will have 100
entries in the file, maybe less if we are doing incrementals, and
the files have not changed).
So, now for the questions in the base note.
First, wild card lookups, why do they take so long.
The time for a wild card lookup is a function of the size of the catalog, and
the method we use for returning the results of the lookup. The size of the
catalog is determined primarily by the number of files that are backed up.
That is, a catalog for a system with 1 million files will be larger than
a catalog for a system with only 10,000 files, if other things are the same.
With a wild card lookup, we must essentially read the entire "common" AOE file
to see which filenames match the wildcard characteristics specified by the
user. We then have to find the corresponding "instance" AOE entries. So,
the bigger the files, the longer it takes. Now, there is also the "method"
that was chosen for returning the entries to the user. The method that was
chosen is to go through the files, build a giant structure that contains
all of the matching entries, and then return the entries one at a time to
the "client" side. So, before we return the first entry, we find them all,
and this takes time.
There are some historical reasons why this method was
chosen, and although it may not be the best for performance in a wildcard
search, it has some simplicity for other types of queries where specific
information will be returned, and there won't be so much information to return.
We are currently examining this, and are continuously looking for ways to
improve the catalog performance at both the time of insertion, and at the
time of lookup.
In the meantime, since the three catalog files are simply Indexed Sequential
RMS files, you can use standard RMS tools, (like analyze RMS) to help tune
the files. For example, looking at things like extents, bucket splits, index
levels, and so on. Convert could then be used to create new files that might
improve performance for the individual customer.
The second question, size of the catalog files. This is really just determined
by the number of files they are backing up, the number of times they do the
backups, and whether they do full backups or incremental backups. In addition,
if they are using a staging catalog, there is a brief period of time while the
staging catalog entries are being added to the main catalog, where the user
will have the size of the staging file, and the increased size of the main
catalog file. Once the staging update is complete, the staging file will be
deleted, so the file space will be returned.
Hope this helps.
Jim and Kim, ABS engineering
|