[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::languages

Title:Languages
Notice:Speaking In Tongues
Moderator:TLE::TOKLAS::FELDMAN
Created:Sat Jan 25 1986
Last Modified:Wed May 21 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:394
Total number of notes:2683

114.0. "Battle of the hashing algorithms" by YIPPEE::DELISLE () Sat Nov 01 1986 17:57

    After reading Note 1.0, I am not sure if this is the right  conference
    for this topic but I know not where else to go.  I have always assumed
    that a conference named  LANGUAGES  would  attract  the  attention  of
    developers of DEC languages.  So to the question...

    Assuming that most (all?) DEC compilers incorporate a  hash  table  to
    store  user-defined  identifiers, is it possible that we might discuss
    the *ultimate* hashing algorithm for the typical VMS identifier  given
    the VAX instruction set.

    If this were some other computer company, I imagine that a  task-force
    of  compiler  experts  would  have  solved this problem and would have
    *mandated* a certain algorithm for all their compiler  groups.   Since
    we  use  our  *brains* here at DEC, instead of *edicts*, we should put
    those brains together and *do the right thing* when possible.

    This whole issue just may be more important to me than other  compiler
    developers  since the compiler that I am responsible for must maintain
    its symbol table not only during the compiler  phase  but  during  the
    execution  of  the compiled image as well!  I even have to worry about
    what to do when the symbol table is full!  Anybody out there have  any
    good ideas about *virtual* symbol tables that just keep on growing?

    Not to be a sluggard, here is the algorithm that I am currently facing
    (in FORGOL :-))

    hash = 0
    REPEAT FOR i = 1 to LENGTH OF THE IDENTIFIER
        hash = hash * 3       (make sensitive the order of the characters)
        hash = hash + char(i) (add the ASCII value of the next character)
    END REPEAT
    hash = hash * 5**13       (multiply by a randomizing value)
    hash = ABSOLUTE_VALUE(hash) MOD tablesize

    New OPS5 project leader,
    Uncle Ray
T.RTitleUserPersonal
Name
DateLines
114.1CLT::GILBERTeager like a childSun Nov 02 1986 10:2510
    Check the journals.

    Many DEC compilers use a balanced binary tree for the symbol table,
    since neither performance nor storage are critical aspects.  The
    VMS logical name table (and DCL symbol symbol table?) use hashing
    -- there, performance *is* critical.

    There's also been some work on finding perfect and near-perfect hash
    functions, for a (usually given) small set of identifiers.  Again,
    check the journals.
114.2Textbooks; open hashing; bad assumptionTURRIS::AMARTINAlan H. MartinSun Nov 02 1986 12:3217
Yes, perfect hashing has been a popular topic during the '80s, and the
journals are probably the place to look for information.

Textbooks like Knuth, any subset of {Aho, Hopcroft, Sethi, Ullman},
Horowitz and Sahni, etc., contain more than enough information about
hash tables for the average application.

All the compilers I've worked on have worked on used open hash tables,
which never become full - they only get progressively more cretinous
performance.  I'd say that the deciding factors for using open hashing
were the availability of heap storage for the table entries, and the desire
to avoid having to deal with tables that can become full.


I think it is wrong to assume that all languages have the same characteristic
set of identifiers.
				/AHM
114.3SMOP::GLOSSOPKent GlossopSun Nov 02 1986 16:3612
    I attended a talk on randomized hash functions when I was at Michigan.
    The person that worked on it was from IBM (TJW?).  Anyway, they found
    that a general hashing function with randomly chosen constants out-
    performed the "hand-tuned" hash algorithm in the IBM FORTRAN compiler
    that was supposed to do very well for programs that have the "typical"
    FORTRAN identifiers like I, J, K, etc.  I can't give you any references
    offhand, but it might make you think twice about spending a lot of time
    pursuing the "ideal" hashing function...  Also, if you haven't done it
    already, you should use PCA to find your hot spots.  It may be that
    your time would be much better spent in other areas.

    Kent Glossop (VAX PL/I)
114.4TPU's solutionCASEE::CLARKWard ClarkMon Nov 10 1986 04:5916
From:  DSSDEV::TANNENBAUM "TPU Developer"

    TPU has similar needs.  The algorithm we're in the process of
    implementing gets around the problems of growth by keeping a linked
    list of symbol tables.  When one fills up, we just add another.
    
    Each symbol table contains a 32 hash buckets.  We choose our bucket by
    XORing the first and last character of the symbol, after stripping
    any known prefixes (like 'TPU$_' and 'EVE$_').  Each symbol entry
    is null padded to a longwork boundry, so we can do longword compares
    (for speed).  The symbol entries are stored in symbol length order, so
    we can drop out early. 
    
    If you want more details, feel free to send mail.
    
    	- Barry
114.5My hashing algorithm (copyright)COMICS::DEMORGANRichard De Morgan, UK CSC/CSThu Sep 24 1987 12:1628
    I have always used chained hashing, where all the names with the
    same hash code are chained together. For VAX, I advocate the following
    version (it is written in Ada; cshift is a circular (left is positive)
    shift and exor is exclusive OR). It generates a hash code in the
    range 0 .. 255, and has been found to be unbiased (i.e. the
    distribution of hash chain lengths is a Poisson distribution) over
    samples of up to 250,000 names. This function is taken from a
    recently-written Ada program that used a task-driven random number
    generator.
    
        FUNCTION hash(name : IN string, length : IN integer)
    	    RETURN integer IS
    
    	    hash_code : integer := 0;
    
    	BEGIN
    	    FOR character_index IN 1 .. length
    	    LOOP
    		hash_code := exor(cshift(hash_code, 4),
    		    char_to_int(name(character_index)));
    	    END LOOP;
    	    RETURN cshift(hash_code, -2*(length - 1)) MOD 256;
        END hash;
    
    N.B. It has the interesting property that the hash of a single charcter
    identifier is the same as the character value. It is also not stem
    or tail sensitive. We used this in the PEARL compiler (32 bit version).
    For 16 bit version, shift left 2 and right 1 respectively.
114.6TLE::RMEYERSRandy MeyersThu Sep 24 1987 17:5515
Re: .5

I typically use a very similar algorithm based on circular shifts and
exclusive or.  The differences are that I usually only shift by one
in the inner loop, I don't shift at all outside of the loop, and I
divide by a prime number to get the hash index.

Dividing by a prime number has the effect that the resulting index
depends on every bit in the result of the loop.

Like you, I have noticed the rotate-xor family of hashing algorithms
does produce very even use of all the buckets in the hash table.

Would you care to comment on why you shift by 4 in the inner loop,
and the function of the finial shift before the division?
114.7some reasonsCOMICS::DEMORGANRichard De Morgan, UK CSC/CSFri Sep 25 1987 10:537
    There are a number of reasons, but the main ones are that it spreads
    the bits of all the characters around, taking into account the
    character encoding. The final shift ensures that single character
    identifiers have unique hashes. The algorithm was also designed
    to take into account the possibility of implementation on machines
    with simple instruction sets, possibly no integer divide (the MOD
    should be optimized to an AND).