T.R | Title | User | Personal Name | Date | Lines |
---|
114.1 | | CLT::GILBERT | eager like a child | Sun Nov 02 1986 10:25 | 10 |
| Check the journals.
Many DEC compilers use a balanced binary tree for the symbol table,
since neither performance nor storage are critical aspects. The
VMS logical name table (and DCL symbol symbol table?) use hashing
-- there, performance *is* critical.
There's also been some work on finding perfect and near-perfect hash
functions, for a (usually given) small set of identifiers. Again,
check the journals.
|
114.2 | Textbooks; open hashing; bad assumption | TURRIS::AMARTIN | Alan H. Martin | Sun Nov 02 1986 12:32 | 17 |
| Yes, perfect hashing has been a popular topic during the '80s, and the
journals are probably the place to look for information.
Textbooks like Knuth, any subset of {Aho, Hopcroft, Sethi, Ullman},
Horowitz and Sahni, etc., contain more than enough information about
hash tables for the average application.
All the compilers I've worked on have worked on used open hash tables,
which never become full - they only get progressively more cretinous
performance. I'd say that the deciding factors for using open hashing
were the availability of heap storage for the table entries, and the desire
to avoid having to deal with tables that can become full.
I think it is wrong to assume that all languages have the same characteristic
set of identifiers.
/AHM
|
114.3 | | SMOP::GLOSSOP | Kent Glossop | Sun Nov 02 1986 16:36 | 12 |
| I attended a talk on randomized hash functions when I was at Michigan.
The person that worked on it was from IBM (TJW?). Anyway, they found
that a general hashing function with randomly chosen constants out-
performed the "hand-tuned" hash algorithm in the IBM FORTRAN compiler
that was supposed to do very well for programs that have the "typical"
FORTRAN identifiers like I, J, K, etc. I can't give you any references
offhand, but it might make you think twice about spending a lot of time
pursuing the "ideal" hashing function... Also, if you haven't done it
already, you should use PCA to find your hot spots. It may be that
your time would be much better spent in other areas.
Kent Glossop (VAX PL/I)
|
114.4 | TPU's solution | CASEE::CLARK | Ward Clark | Mon Nov 10 1986 04:59 | 16 |
| From: DSSDEV::TANNENBAUM "TPU Developer"
TPU has similar needs. The algorithm we're in the process of
implementing gets around the problems of growth by keeping a linked
list of symbol tables. When one fills up, we just add another.
Each symbol table contains a 32 hash buckets. We choose our bucket by
XORing the first and last character of the symbol, after stripping
any known prefixes (like 'TPU$_' and 'EVE$_'). Each symbol entry
is null padded to a longwork boundry, so we can do longword compares
(for speed). The symbol entries are stored in symbol length order, so
we can drop out early.
If you want more details, feel free to send mail.
- Barry
|
114.5 | My hashing algorithm (copyright) | COMICS::DEMORGAN | Richard De Morgan, UK CSC/CS | Thu Sep 24 1987 12:16 | 28 |
| I have always used chained hashing, where all the names with the
same hash code are chained together. For VAX, I advocate the following
version (it is written in Ada; cshift is a circular (left is positive)
shift and exor is exclusive OR). It generates a hash code in the
range 0 .. 255, and has been found to be unbiased (i.e. the
distribution of hash chain lengths is a Poisson distribution) over
samples of up to 250,000 names. This function is taken from a
recently-written Ada program that used a task-driven random number
generator.
FUNCTION hash(name : IN string, length : IN integer)
RETURN integer IS
hash_code : integer := 0;
BEGIN
FOR character_index IN 1 .. length
LOOP
hash_code := exor(cshift(hash_code, 4),
char_to_int(name(character_index)));
END LOOP;
RETURN cshift(hash_code, -2*(length - 1)) MOD 256;
END hash;
N.B. It has the interesting property that the hash of a single charcter
identifier is the same as the character value. It is also not stem
or tail sensitive. We used this in the PEARL compiler (32 bit version).
For 16 bit version, shift left 2 and right 1 respectively.
|
114.6 | | TLE::RMEYERS | Randy Meyers | Thu Sep 24 1987 17:55 | 15 |
| Re: .5
I typically use a very similar algorithm based on circular shifts and
exclusive or. The differences are that I usually only shift by one
in the inner loop, I don't shift at all outside of the loop, and I
divide by a prime number to get the hash index.
Dividing by a prime number has the effect that the resulting index
depends on every bit in the result of the loop.
Like you, I have noticed the rotate-xor family of hashing algorithms
does produce very even use of all the buckets in the hash table.
Would you care to comment on why you shift by 4 in the inner loop,
and the function of the finial shift before the division?
|
114.7 | some reasons | COMICS::DEMORGAN | Richard De Morgan, UK CSC/CS | Fri Sep 25 1987 10:53 | 7 |
| There are a number of reasons, but the main ones are that it spreads
the bits of all the characters around, taking into account the
character encoding. The final shift ensures that single character
identifiers have unique hashes. The algorithm was also designed
to take into account the possibility of implementation on machines
with simple instruction sets, possibly no integer divide (the MOD
should be optimized to an AND).
|