[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::languages

Title:	Languages
Notice:	Speaking In Tongues
Moderator:	TLE::TOKLAS::FELDMAN

Created:	Sat Jan 25 1986
Last Modified:	Wed May 21 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	394
Total number of notes:	2683

114.0. "Battle of the hashing algorithms" by YIPPEE::DELISLE () Sat Nov 01 1986 17:57

    After reading Note 1.0, I am not sure if this is the right  conference
    for this topic but I know not where else to go.  I have always assumed
    that a conference named  LANGUAGES  would  attract  the  attention  of
    developers of DEC languages.  So to the question...

    Assuming that most (all?) DEC compilers incorporate a  hash  table  to
    store  user-defined  identifiers, is it possible that we might discuss
    the *ultimate* hashing algorithm for the typical VMS identifier  given
    the VAX instruction set.

    If this were some other computer company, I imagine that a  task-force
    of  compiler  experts  would  have  solved this problem and would have
    *mandated* a certain algorithm for all their compiler  groups.   Since
    we  use  our  *brains* here at DEC, instead of *edicts*, we should put
    those brains together and *do the right thing* when possible.

    This whole issue just may be more important to me than other  compiler
    developers  since the compiler that I am responsible for must maintain
    its symbol table not only during the compiler  phase  but  during  the
    execution  of  the compiled image as well!  I even have to worry about
    what to do when the symbol table is full!  Anybody out there have  any
    good ideas about *virtual* symbol tables that just keep on growing?

    Not to be a sluggard, here is the algorithm that I am currently facing
    (in FORGOL :-))

    hash = 0
    REPEAT FOR i = 1 to LENGTH OF THE IDENTIFIER
        hash = hash * 3       (make sensitive the order of the characters)
        hash = hash + char(i) (add the ASCII value of the next character)
    END REPEAT
    hash = hash * 5**13       (multiply by a randomizing value)
    hash = ABSOLUTE_VALUE(hash) MOD tablesize

    New OPS5 project leader,
    Uncle Ray

T.R	Title	User	Personal Name	Date	Lines
114.1		CLT::GILBERT	eager like a child	`Sun Nov 02 1986 10:25`	10
	Check the journals. Many DEC compilers use a balanced binary tree for the symbol table, since neither performance nor storage are critical aspects. The VMS logical name table (and DCL symbol symbol table?) use hashing -- there, performance is critical. There's also been some work on finding perfect and near-perfect hash functions, for a (usually given) small set of identifiers. Again, check the journals.
114.2	Textbooks; open hashing; bad assumption	TURRIS::AMARTIN	Alan H. Martin	`Sun Nov 02 1986 12:32`	17
	Yes, perfect hashing has been a popular topic during the '80s, and the journals are probably the place to look for information. Textbooks like Knuth, any subset of {Aho, Hopcroft, Sethi, Ullman}, Horowitz and Sahni, etc., contain more than enough information about hash tables for the average application. All the compilers I've worked on have worked on used open hash tables, which never become full - they only get progressively more cretinous performance. I'd say that the deciding factors for using open hashing were the availability of heap storage for the table entries, and the desire to avoid having to deal with tables that can become full. I think it is wrong to assume that all languages have the same characteristic set of identifiers. /AHM
114.3		SMOP::GLOSSOP	Kent Glossop	`Sun Nov 02 1986 16:36`	12
	I attended a talk on randomized hash functions when I was at Michigan. The person that worked on it was from IBM (TJW?). Anyway, they found that a general hashing function with randomly chosen constants out- performed the "hand-tuned" hash algorithm in the IBM FORTRAN compiler that was supposed to do very well for programs that have the "typical" FORTRAN identifiers like I, J, K, etc. I can't give you any references offhand, but it might make you think twice about spending a lot of time pursuing the "ideal" hashing function... Also, if you haven't done it already, you should use PCA to find your hot spots. It may be that your time would be much better spent in other areas. Kent Glossop (VAX PL/I)
114.4	TPU's solution	CASEE::CLARK	Ward Clark	`Mon Nov 10 1986 04:59`	16
	From: DSSDEV::TANNENBAUM "TPU Developer" TPU has similar needs. The algorithm we're in the process of implementing gets around the problems of growth by keeping a linked list of symbol tables. When one fills up, we just add another. Each symbol table contains a 32 hash buckets. We choose our bucket by XORing the first and last character of the symbol, after stripping any known prefixes (like 'TPU$_' and 'EVE$_'). Each symbol entry is null padded to a longwork boundry, so we can do longword compares (for speed). The symbol entries are stored in symbol length order, so we can drop out early. If you want more details, feel free to send mail. - Barry
114.5	My hashing algorithm (copyright)	COMICS::DEMORGAN	Richard De Morgan, UK CSC/CS	`Thu Sep 24 1987 11:16`	28
	I have always used chained hashing, where all the names with the same hash code are chained together. For VAX, I advocate the following version (it is written in Ada; cshift is a circular (left is positive) shift and exor is exclusive OR). It generates a hash code in the range 0 .. 255, and has been found to be unbiased (i.e. the distribution of hash chain lengths is a Poisson distribution) over samples of up to 250,000 names. This function is taken from a recently-written Ada program that used a task-driven random number generator. FUNCTION hash(name : IN string, length : IN integer) RETURN integer IS hash_code : integer := 0; BEGIN FOR character_index IN 1 .. length LOOP hash_code := exor(cshift(hash_code, 4), char_to_int(name(character_index))); END LOOP; RETURN cshift(hash_code, -2*(length - 1)) MOD 256; END hash; N.B. It has the interesting property that the hash of a single charcter identifier is the same as the character value. It is also not stem or tail sensitive. We used this in the PEARL compiler (32 bit version). For 16 bit version, shift left 2 and right 1 respectively.
114.6		TLE::RMEYERS	Randy Meyers	`Thu Sep 24 1987 16:55`	15
	Re: .5 I typically use a very similar algorithm based on circular shifts and exclusive or. The differences are that I usually only shift by one in the inner loop, I don't shift at all outside of the loop, and I divide by a prime number to get the hash index. Dividing by a prime number has the effect that the resulting index depends on every bit in the result of the loop. Like you, I have noticed the rotate-xor family of hashing algorithms does produce very even use of all the buckets in the hash table. Would you care to comment on why you shift by 4 in the inner loop, and the function of the finial shift before the division?
114.7	some reasons	COMICS::DEMORGAN	Richard De Morgan, UK CSC/CS	`Fri Sep 25 1987 09:53`	7
	There are a number of reasons, but the main ones are that it spreads the bits of all the characters around, taking into account the character encoding. The final shift ensures that single character identifiers have unique hashes. The algorithm was also designed to take into account the possibility of implementation on machines with simple instruction sets, possibly no integer divide (the MOD should be optimized to an AND).