[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1815.0. "Sample size statistics problem" by AOSG::TOM () Thu Nov 11 1993 13:20

    
    This question is similar to the polling people use when they
    predict election results, but I'm not sure how to set up the
    equation.
    
    If I have a population of 2**64 addresses, and they are layed
    out from 0 to (2**64)-1 in increments of 1.  These addresses
    contain a value that is either labeled as "good" or "bad."
    I do not know the distribution of the "good" or "bad", but I
    can say that the vast majority are probably "good."  
    How many addresses do I have to sample to make a statement 
    of the form "I'm xyz% confident that there are no 'bad' 
    addresses" and be statistically correct?  
    
    Exit polls on elections do this all the time, but I don't know
    how they set it up.  
    
    Thanks for any help!
    
    /tom

T.R	Title	User	Personal Name	Date	Lines
1815.1	It's a long time since I studied statistics, but	KERNEL::JACKSON	Peter Jackson - UK CSC TP/IM	`Fri Nov 12 1993 12:15`	14
	I don't think this is a similar problem to the exit polls. The technique I learnt at school involves the assumption (or proof) that the distribution of the sample results be normal. Your sample results will have a binominal distribution, but the test would be against the hypothesis that there was 1 bad address, giving a very skewed distribution for which the normal distribution is not a good approximation. If there is one bad address out of 2^64 then he chance of a sample of size n having no bad addresses is (1-2^-64)^n. I would say that could say that you are (1-(1-2^-64)^n)*100% confident that there is not one error if you find no errors with a sample size of n. So if you want to p% confident you need (1-2^-64)^n to equal 1-p/100. So n=log(1-p/100)/log(1-2^-64). Peter
1815.2	in case you are still wondering about this...	ICARUS::NEILSEN	Wally Neilsen-Steinhardt	`Wed Nov 24 1993 12:58`	35
	There is a fairly simple minded answer you can find in any statistics book. Take M randomly chosen items from the population, of size N. If just one of the N is bad, you have a probability of P(M,N) = ( 1 - 1/N )^M of getting a completely good sample. Your goal is to drive this probability below some suitable threshold. X > P(M,N) The calculations will be easier if you take logs log X > log P(M,N) = M * log ( 1 - 1/N ) and solve for M M > ( log X ) / ( log ( 1 - 1/N ) ) Unfortunately, I think you will find that M is extremely close to N. Why is your answer different from the exit polls, which usually can use small samples? There are two reasons. First, exit polls use a lot of prior knowledge about voters to sample by age, income, race and so forth. Then they use complex models to combine these samples into a total prediction. You could use similar techniques if you can learn something about the distribution of GOOD and BAD. Second, exit polls usually aim for an error of a few percent or so. You are actually aiming for an error of 1 in N. If this is really your aim, you might as well test all N and be done with it. If your aim is really to show that the BAD count is less than some number B, fairly far from both 1 and N, then you would find that M is a much smaller number.