[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1120.0. "Rating a Hashing function" by TRACE::GILBERT (Ownership Obligates) Fri Sep 15 1989 11:27

Is there a formula for rating a hashing function?  Hashing a set of values
should give a set of numbers uniformly distributed in the range 0..N-1.
Given the distribution, the number of numbers that fall into each of these
N buckets, how may the uniformity of the distribution be rated?
T.RTitleUserPersonal
Name
DateLines
1120.1I've seen...VMSDEV::HALLYBThe Smart Money was on GoliathFri Sep 15 1989 11:547
    I suppose you could use a chi-square test, but that ignores the inner
    workings of the hash function.
    
    Typically you would run empirical tests with the table 95% loaded,
    and compute the mean (and max) number of probes to look up an entry.
    
      John
1120.2Simulate input distribution, then use chi-squareCOOKIE::PBERGHPeter Bergh, DTN 523-3007Fri Sep 15 1989 13:4933
           <<< Note 1120.0 by TRACE::GILBERT "Ownership Obligates" >>>
                         -< Rating a Hashing function >-

>> Is there a formula for rating a hashing function?  Hashing a set of values
>> should give a set of numbers uniformly distributed in the range 0..N-1.
>> Given the distribution, the number of numbers that fall into each of these
>> N buckets, how may the uniformity of the distribution be rated?
    
    The performance of any hashing function is crucially dependent on the
    distribution of the input numbers; if the distribution is such that you
    get plenty of collisions with your given hashing algorithm, you will
    get very bad performance even though the table may be nowhere near
    full.
    
    Having stated the standard caveat, I would suggest that you do:
    
    1.	Get a representative sample of the input distribution.
    
    2.	Feed the representative sample through the hash function and note
    	the number of elements that fall in each bucket.
    
    3.	See if the number of collisions is excessive (it's up to you to
    	decide what "excessive" means).  If the number of collisions is
    	excessive, get a new algorithm.
    
    4.	Run a chi-square test against a uniform distribution.  (There are
    	some tests that don't depend on an underlying distribution, but �/
    	these tests tend to be less powerful than the "standard" ones (because
    	they make fewer assumptions) and �/ the standard tests tend to give
    	fairly good results even when some of their underlying assumptions are
    	violated.)