[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1815.0. "Sample size statistics problem" by AOSG::TOM () Thu Nov 11 1993 13:20

    
    This question is similar to the polling people use when they
    predict election results, but I'm not sure how to set up the
    equation.
    
    If I have a population of 2**64 addresses, and they are layed
    out from 0 to (2**64)-1 in increments of 1.  These addresses
    contain a value that is either labeled as "good" or "bad."
    I do not know the distribution of the "good" or "bad", but I
    can say that the vast majority are probably "good."  
    How many addresses do I have to sample to make a statement 
    of the form "I'm xyz% confident that there are no 'bad' 
    addresses" and be statistically correct?  
    
    Exit polls on elections do this all the time, but I don't know
    how they set it up.  
    
    Thanks for any help!
    
    /tom
    
T.RTitleUserPersonal
Name
DateLines
1815.1It's a long time since I studied statistics, butKERNEL::JACKSONPeter Jackson - UK CSC TP/IMFri Nov 12 1993 12:1514
    I don't think this is a similar problem to the exit polls. The
    technique I learnt at school involves the assumption (or proof) that the
    distribution of the sample results be normal. Your sample results will
    have a binominal distribution, but the test would be against the
    hypothesis that there was 1 bad address, giving a very skewed
    distribution for which the normal distribution is not a good
    approximation. If there is one bad address out of 2^64 then he chance
    of a sample of size n having no bad addresses is (1-2^-64)^n. I would
    say that could say that you are (1-(1-2^-64)^n)*100% confident that
    there is not one error if you find no errors with a sample size of n.
    So if you want to p% confident you need (1-2^-64)^n to equal 1-p/100.
    So n=log(1-p/100)/log(1-2^-64).
    
    Peter
1815.2in case you are still wondering about this...ICARUS::NEILSENWally Neilsen-SteinhardtWed Nov 24 1993 12:5835
There is a fairly simple minded answer you can find in any statistics book.

Take M randomly chosen items from the population, of size N.  If just one
of the N is bad, you have a probability of 

	P(M,N) = ( 1 - 1/N )^M

of getting a completely good sample.  Your goal is to drive this probability
below some suitable threshold.

	X > P(M,N) 

The calculations will be easier if you take logs

	log X > log P(M,N) = M * log ( 1 - 1/N )

and solve for M

	M > ( log X ) / ( log ( 1 - 1/N ) )

Unfortunately, I think you will find that M is extremely close to N.

Why is your answer different from the exit polls, which usually can use small 
samples?  There are two reasons.

First, exit polls use a lot of prior knowledge about voters to sample by 
age, income, race and so forth.  Then they use complex models to combine 
these samples into a total prediction.  You could use similar techniques if 
you can learn something about the distribution of GOOD and BAD.

Second, exit polls usually aim for an error of a few percent or so.  You 
are actually aiming for an error of 1 in N.  If this is really your aim, you
might as well test all N and be done with it.  If your aim is really to show
that the BAD count is less than some number B, fairly far from both 1 and N,
then you would find that M is a much smaller number.