[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1120.0. "Rating a Hashing function" by TRACE::GILBERT (Ownership Obligates) Fri Sep 15 1989 10:27

Is there a formula for rating a hashing function?  Hashing a set of values
should give a set of numbers uniformly distributed in the range 0..N-1.
Given the distribution, the number of numbers that fall into each of these
N buckets, how may the uniformity of the distribution be rated?

T.R	Title	User	Personal Name	Date	Lines
1120.1	I've seen...	VMSDEV::HALLYB	The Smart Money was on Goliath	`Fri Sep 15 1989 10:54`	7
	I suppose you could use a chi-square test, but that ignores the inner workings of the hash function. Typically you would run empirical tests with the table 95% loaded, and compute the mean (and max) number of probes to look up an entry. John
1120.2	Simulate input distribution, then use chi-square	COOKIE::PBERGH	Peter Bergh, DTN 523-3007	`Fri Sep 15 1989 12:49`	33
	<<< Note 1120.0 by TRACE::GILBERT "Ownership Obligates" >>> -< Rating a Hashing function >- >> Is there a formula for rating a hashing function? Hashing a set of values >> should give a set of numbers uniformly distributed in the range 0..N-1. >> Given the distribution, the number of numbers that fall into each of these >> N buckets, how may the uniformity of the distribution be rated? The performance of any hashing function is crucially dependent on the distribution of the input numbers; if the distribution is such that you get plenty of collisions with your given hashing algorithm, you will get very bad performance even though the table may be nowhere near full. Having stated the standard caveat, I would suggest that you do: 1. Get a representative sample of the input distribution. 2. Feed the representative sample through the hash function and note the number of elements that fall in each bucket. 3. See if the number of collisions is excessive (it's up to you to decide what "excessive" means). If the number of collisions is excessive, get a new algorithm. 4. Run a chi-square test against a uniform distribution. (There are some tests that don't depend on an underlying distribution, but �/ these tests tend to be less powerful than the "standard" ones (because they make fewer assumptions) and �/ the standard tests tend to give fairly good results even when some of their underlying assumptions are violated.)