[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1071.0. "Std. Deviation of Boolean data?" by BESS::NAGARAJAN () Mon May 01 1989 13:02

             
    This is a statistics problem.
    
    This may really sound like a fundamental/stupid question .. 
    
    I have a problem wherein I have data regarding from a survey. It
    is a YES or NO kind of data. Specifically, Do you have problems
    printing? 61% of the people surveyed said YES, the rest said NO.
    I am planning to do a hypothesis testing on this data to determine
    if indeed there is a problem, using the one-tail test.
    
    My question is, how do you calculate the standard deviation of such
    a data?? I recall vaguely that one can use a fixed std. deviation
    (forget how much it is..) when std deviation is not given or is
    unknown. Does any one remember what this figure is?
    
    Better yet, how can I find the std deviation of the above data.
                          
    Thanx for any pointers!!
    
    -gopal-
    
    
    
T.RTitleUserPersonal
Name
DateLines
1071.1There are only stupid answers...CADSYS::COOPERTopher CooperMon May 01 1989 13:3527
    It's fundamental and elementary but not at all stupid.
    
    Calculate the following:
    
    		 O - pN
    	    Z = ---------
    		SQRT(Npq)
    
    If your sample is large enough, then this will follow the standard
    normal distribution.  The variance is Npq and the standard deviation is
    SQRT(Npq).  Here, O is the number of yes answers (or no answers if you
    prefer), N is the number of replies, p is the proportion of answers you
    would expect by chance to be yes (or no), and q is (1-p).  "Large
    enough" means that both pN and qN are greater than 5 (or better yet
    10).
    
    If the Z-score calculated above is greater than 1.65, then you can
    reject at the 95% confidence level the possibility of chance causing
    the observed deviation.
    
    Be careful with one-tailed tests -- be sure you understand what it
    means to your problem if O - pN is negative (I assumed above that you
    were looking for a positive deviation from chance levels, if you were
    looking for one smaller than chance, then change the numerator to
    (pN - O) and precede as before).
    
    						Topher
1071.2a slightly different approachPULSAR::WALLYWally Neilsen-SteinhardtWed May 03 1989 19:02110
re: < Note 1071.0 by BESS::NAGARAJAN >
             
>    I have a problem wherein I have data regarding from a survey. It
>    is a YES or NO kind of data. Specifically, Do you have problems
>    printing? 61% of the people surveyed said YES, the rest said NO.
>    I am planning to do a hypothesis testing on this data to determine
>    if indeed there is a problem, using the one-tail test.
    
    My first reaction was to talk about Bayesian analysis and decision
    theory, and I'll get back to that in a minute, but first
    
    	YES!!! we have a problem
    
    Assuming any reasonable form of your survey question, if 61% of
    the users say they have problems printing, then we have a problem.
    Time to close the stat book and solve the problem with printing.
    
    But let's assume that you need cooperation from others and you expect
    those others to raise a few statistical questions, and you want
    to be prepared.
    
    I suggest that the best approach to this kind of situation is to 
    ignore hypothesis testing and follow a path through point estimate, 
    interval estimate and decision theory, stopping wherever you feel
    comfortable.  This is not a disagreement with .1, but a refocusing
    of the problem.

    I will assume, as .1 does, that there is some population of users
    out there, some fraction of whom have problems printing, and that
    you are interested in what you have learned about this fraction.
    I also assume that you have done all the right things to get a valid
    sample and asked the right questions in the right way.

    
    POINT ESTIMATE
    
    The best estimate of percent of the population having problems 
    printing is the sample percentage or 61%.
    
    
    INTERVAL ESTIMATE
    
    Now we need the standard deviation
    
>    My question is, how do you calculate the standard deviation of such
>    a data?? I recall vaguely that one can use a fixed std. deviation
>    (forget how much it is..) when std deviation is not given or is
>    unknown. Does any one remember what this figure is?
    
>    Better yet, how can I find the std deviation of the above data.
    
    The standard deviation is given by
    
    	SQRT( p * q / n )
    
    where
    
    	p = fraction answering yes
    	q = 1-p = fraction answering no
    	n = sample size
    
    provided n is reasonably large, usually taken to be n>30 or so.
    Note the similarity to .1
    
    Assuming that there were 100 people in your sample
    
    	SD = SQRT ( .61 * .39 / 100 ) = 0.049 or 4.9%
    
    You can combine this with a table of cumulative standard normal
    probabilities to define the confidence interval.  For example, the
    95% confidence interval is at 2 times SD, or in this case
    
    	[ 56.1% , 65.9% ]
    
    or what may be more relevant to your situation, there is less than
    a 0.3% chance that less than 56% of the population would give a
    yes answer.  In other words, you can be virtually sure that the
    majority of users have problems printing.
    
    I think this is probably enough to convince people to take action,
    but if not, read on.
    
    
    DECISION THEORY
    
    I can't do much explicit in this area, without more information
    about your problem, so this is going to get kind of sketchy.
    
    The simplest approach is simply to ask how much customer
    dissatisfaction we are willing to tolerate.  If the answer is none,
    then we close the stat book and solve the printing problem.  If
    the answer is some positive number then we decide whether it is
    above, inside or below our confidence interval and take action
    accordingly.  Note that hypothesis testing may become relevant here,
    if the tolerable value is close to the end point of the interval.
    
    If the answer is that it depends on how much it will cost us to
    leave it alone and how much it will cost us to fix, then we can
    apply decision theory.  We figure out the cost of trying to fix
    a non-existent problem and the cost of not fixing a real problem.
    Then we characterize the two situations (non-existent problem and
    real problem) in terms of the expected outcome of a survey.  We
    apply Bayesian analysis to determine the probability of a real problem,
    and decision theory to compute the expected cost of fixing vs doing
    nothing.  The arithmetic, when we are done, looks a lot like the
    interval analysis above.
    
    Hmm.  That last paragraph was brief to the point of inscrutability,
    but if you need to know more, you can look it up or ask in another
    reply.
1071.3AITG::DERAMODaniel V. {AITG,ZFC}:: D&#039;EramoWed May 03 1989 19:3714
	re .2

>>    probabilities to define the confidence interval.  For example, the
>>    95% confidence interval is at 2 times SD, or in this case
>>    
>>    	[ 56.1% , 65.9% ]
>>    
>>    or what may be more relevant to your situation, there is less than
>>    a 0.3% chance that less than 56% of the population would give a

	You meant "less than a 3% chance" there -- half of what's left
	over when you aren't in the 95% confidence interval (half of 5%).

	Dan
1071.4correction acceptedPULSAR::WALLYWally Neilsen-SteinhardtThu May 04 1989 14:355
re: < Note 1071.3 by AITG::DERAMO "Daniel V. {AITG,ZFC}:: D'Eramo" >

>	You meant "less than a 3% chance" there 
    
    correct.  arithmetic is not my strong point
1071.5Building confidence.CADSYS::COOPERTopher CooperThu May 04 1989 15:2959
RE: 1071.2 (Wally Neilsen-Steinhardt):

    Using parameter estimates and decision theory rather than hypothesis
    testing -- 100% correct, I should have listened to the underlying
    question rather than the surface one.

    Interval estimate -- sorry, not quite correct.  Comments and corrections:

    First off, a caveat.  The method you give is an approximation of an
    approximation.  Technically it is the confidence interval for the
    sample proportion when the underlying population proportion is known.
    That is, if we know that the real proportion is 61% than we can be
    95% sure that the proportion found in any particular sample of N will
    lie between the bounds you calculate.  What we really want is the
    confidence interval for the unknown population proportion given that we
    know a single sample proportion is 61%.  The latter is a much more
    complex calculation even after we have assumed a normal approximation.
    However, when not only N, but y*N and (1-y)*N (with y being the
    proportion of yes answers in a particular "experiment") is large
    enough, then the former confidence interval is a good approximation
    for the latter, swapping y and p around.  I would say that both y*N and
    (1-y)*N should be greater than 5, or, better yet, greater than 10.

    Second: a clarification for some who might need it.  The SD I gave in
    .1, SQRT(pqN), is the standard deviation in the *count* of the number
    of yes answers, while the SD you gave, SQRT(pq/N) is the standard
    deviation in the *proportion* of the number of yes answers.  It is
    gotten by dividing the count SD by N.

    Third: An actual error.  You goofed on the Z value.

    What we want here is a value for Z, so that the cumulative probability
    of a standard normal variate being between -Z and +Z is .95.  Most
    tables of the cumulative standard normal distribution give the
    probability that a particular sample will be less than a given Z.
    Some manipulation based on the symmetry of the normal distribution gives
    us that for a confidence interval of size C (e.g., C=.95), we look
    for the Z value with a probability of 1 - (1-C)/2.  For C=.95, we
    want the Z value with cumulative probability of .975, which is just
    about 1.96, or approximately 2 (as you said).  But that is 2 standard
    deviations below and two above for a total "width" of 4 standard
    deviations, rather than 1 below and 1 above for a width of 2.

    The confidence interval should be, therefore:

	    [51.2%, 70.8%]

    What may have confused you is the custom in physics of reporting values
    with +or- "error bounds" of one standard deviation.  If the errors
    follow a normal distribution this corresponds to a confidence interval
    of the totally arbitrary size of .68, and when they do not it is
    generally completely meaningless.  Since there is usually no indication
    of the distributions of the errors given (and contrary to belief 60
    years or so ago when the practice grew up, normal is *not* normal for
    errors) use of these quantities is a demonstration of the irrationality
    of physicists (but wait, we're having *that* discussion in another
    conference).

					Topher
1071.6Bayesian intervals on proportionsPULSAR::WALLYWally Neilsen-SteinhardtFri May 05 1989 12:5644
re: < Note 1071.5 by CADSYS::COOPER "Topher Cooper" >

>    First off, a caveat.  The method you give is an approximation of an
>    approximation.  
    
    What I did is actually much more correct than it looks, because
    I did not burden my initial reply with a lot of details.
    
    What I really did was compute the interval for the unknown population
    proportion.  As a good Bayesian, I have no trouble doing this, although
    any frequentist readers might find it hard to swallow.  As a good
    Bayesian, I am supposed to call this a credible interval, but that's
    too pedantic for me.
    
    The calculation is actually amazingly simple, since this is one
    of those cases where a conjugate prior distribution has been found.
    In this case it is the beta distribution, which you can look up,
    since if I try to type in the definition, I'll just forget a r-1
    somewhere.  For small n, r, or n-r, I can use the exact form of
    the beta distribution, but if they are all large, I can use a normal
    approximation.  Amazingly, the normal approximation to the beta
    is quite close to the normal approximation to the binomial, so it
    looks like I have just done the 'approximation to an approximation'
    described above.  But really, I have solved a problem exactly in
    terms of a beta distribution, and then approximated that with a
    normal distribution.

>    Second: a clarification for some who might need it.  The SD I gave in
>    .1, SQRT(pqN), is the standard deviation in the *count* of the number
>    of yes answers, while the SD you gave, SQRT(pq/N) is the standard
>    deviation in the *proportion* of the number of yes answers.  It is
>    gotten by dividing the count SD by N.
    
    quite correct.

>    Third: An actual error.  You goofed on the Z value.

    Alas, this is correct too.  .5 has an interesting hypothesis on
    how I could have made this error in a smart way, but I've made that
    smart error too often, so I checked my table carefully.  But then
    I made the dumb error: I just forgot to multiply the SD by 2 in
    my calculator.  Those who wish to avoid the smart error should read
    .5 carefully.  To those who wish to avoid the dumb error, I can
    offer no advice.
1071.7AITG::DERAMODaniel V. {AITG,ZFC}:: D&#039;EramoFri May 05 1989 14:1426
>>	>    First off, a caveat.  The method you give is an approximation of an
>>	>    approximation.  

	This was talking about taking the standard deviation,
	sqrt(NPQ), where P is the probability of a yes answer,
	Q = 1 - P, and N is the number of people asked.  The
	problem was that P was not known.  So instead of the
	unknowns P and Q = 1 - P, he used p and q = 1 - p where
	p = number_of_yes_answers / N.  So p is a good estimate
	for P and q is a good estimate for Q.  The caveat was
	about having done this -- you didn't use the real standard
	deviation.

	An alternative is to notice that the standard deviation
	is largest when PQ = P(1 - P) is largest, which happens
	when P = Q = 0.5.  In that case SD <= sqrt(N / 4) =
	sqrt(N) / 2.  This method "plays it safe" by using the
	largest possible standard deviation of a binomial
	distribution.  The error bounds determined this way may
	end up being a little larger than necessary.  But that's
	okay if it is more important that the bounds include
	the (unknown) value of P than that the bounds be small.
	(The bounds being "P is xx% likely to be in the range
	p +/- yy standard deviations.")

	Dan
1071.8asymptotic approach to clarityPULSAR::WALLYWally Neilsen-SteinhardtMon May 08 1989 13:0410
re: < Note 1071.7 by AITG::DERAMO "Daniel V. {AITG,ZFC}:: D'Eramo" >

    Dan,
    
    I think we are talking at cross-purposes here.  Can you please reread
    .6?  I thought that I answered this comment there.  If not, I'll
    try again.  Let me know what you have comments about, questions
    on, or disagree with.
    
    Wally
1071.9AITG::DERAMODaniel V. {AITG,ZFC}:: D&#039;EramoMon May 08 1989 13:4613
     Wally,
     
     .6 didn't say what the caveat was about so I described it,
     not remembering that it was all spelled out in .5.  My
     alternative answer to it was directed to those who don't
     know what a beta distribution is or why it applies -- just
     use the known upper bound on the standard deviation instead.
     
>>	Let me know what you have ... questions on ....
     
     What's a beta distribution and why does it apply here? :-)
     
     Dan
1071.10You beta your life!CADSYS::COOPERTopher CooperMon May 08 1989 14:2417
RE: .9 (Dan)
    
    Basically, the beta distribution is a continuous generalization of the
    binomial distribution.  Where the binomial distribution uses factorial,
    the beta distribution uses the gamma function.  Since beta is
    continuous it is "integratable" and differentiable unlike the binomial
    distribution.  This allows various problems to be solved for it, which
    can be re-specialized to integer values -- i.e., to the binomial
    distribution.
    
    Actually, solving the problem in terms of the beta distribution,
    results in a solution expressible in terms of the more familiar Fisher
    F distribution (which is another special case of the beta
    distribution).  Tables of the F distribution can be found in most
    elementary statistics books.
    
    					Topher
1071.11If I follow you...CADSYS::COOPERTopher CooperMon May 08 1989 16:0192
RE: .6
    
>>    First off, a caveat.  The method you give is an approximation of an
>>    approximation.
>
>		...
>
>    The calculation is actually amazingly simple, since this is one
>    of those cases where a conjugate prior distribution has been found.
    
    OK, I was wrong.  It wasn't an approximation of an approximation of
    a classical confidence interval.  Its an approximation to the exact
    solution to an approximation of a Bayesian confidence interval (a.k.a.,
    credibility interval or credible interval). ;-)  (Unless I've
    misinterpretted what you were saying, of course).
    
    Use of a conjugate prior distribution is only justified in general when
    the amount of evidence to be considered is large enough so that the
    effect of the specific choice of initial prior distribution effectively
    vanishes.  In the case of binomial sampling, the amount of evidence
    is N, the number of samples.
    
For the non Bayesians in the crowd:
    
    In some cases, for a particular Bayesian calculation, you can find a
    family of distributions with the property that if the *prior*
    distribution is a member of that family than the *final* distribution
    is also.  Such a family is called a conjugate distribution family for
    the distribution on which the calculation is based (actually there is
    a bit more to it than that, but that's the main part).
    
    Use of a member of a conjugate distribution family as the prior
    distribution results in clean analytic solutions to Bayesian inference
    problems.  In particular, sometimes there is a member of the family
    which you can extrapolate down to representing "no prior information",
    and this makes for particularly clean solutions (note, however, that
    in general, assuming this specific member of the family involves
    assuming that one value is *a priori* more likely than another without
    any justification at all, and that a different way of accounting for
    the evidence would call for a different initial distribution which
    would give contradtory liklihoods for those same values).
    
    Sometimes the family is rich enough so that you can frequently pick
    a member which closely approximates your "real" prior expectations,
    and sometimes you have enough evidence so that it doesn't matter very
    much what your "real" prior was.  But unless you can justify a family
    member as the exact prior distribution, use of the conjugate families
    result only in an approximation.
    
    In general, the only distribution justified non-subjectively is an
    appropriate uniform one -- that being justified by the "principle of
    prior ignorance."
    
Further comments:
    
    In actuallity there is no effective difference in this situation
    between the values found for the classical confidence interval, the
    Bayesian confidence interval for uniform prior distribution, or for
    the so called "fiducial confidence interval" (a sort of classical
    imitation of Bayesian statistics).  Under reasonable assumptions they
    give the same results (though interpretted differently).
    
    My caveat stands, its a handy approximation for hand calculation when
    you have both enough "yes" and "no" answers, but better formula exist,
    and, if you are writing a program, should be used.
    
    For those who are interested the exact solution for the classical
    symmetric confidence interval (which is, I'm reasonably certain, the
    same as the uniform prior Bayesian confidence interval) is
    
    		      x
    	p1 = --------------------
    	      x + (n - x + 1)*F1
    
    		 (x + 1)*F2
    	p2 = --------------------
    	      n - x + (x + 1)*F2
    
    where
    
    	x is the number of "yes" answers, n is the total number of answers
    and
    
    	F1 = F[df1 = 2*(n-x+1), df2 = 2*x, C]
    	F2 = F[df1 = 2*(x+1), df2 = 2*(n-x), C]
    
    and C is the confidence factor desired (e.g., .95 for a 95% confidence
    interval), and F[df1, df2, C] is the value of the F distribution with
    the given paramenters (df1 and df2) with the given cumulative
    probability.
    
    					Topher
1071.12more on the betaPULSAR::WALLYWally Neilsen-SteinhardtMon May 08 1989 18:3283
    My beta distribution is rather different from that described in
    .10, and my procedure is rather different as well.
    
    My beta is a parametric family of functions with variable p and
    parameters n and r:
    
    
    	f(p; n,r) = 	(n-1)! * p^(r-1) * (1-p)^(n-r-1)
    			--------------------------------
    			(r-1)! * (n-r-1)!
    
    where 0<=p<=1 and the function is zero elsewhere.
    
re: < Note 1071.11 by CADSYS::COOPER "Topher Cooper" >

>    In particular, sometimes there is a member of the family
>    which you can extrapolate down to representing "no prior information",
>    and this makes for particularly clean solutions 
    
    And this is the case for the beta above, where r=1, n=2 gives a
    uniform function of p, or uniform prior distribution.

>    But unless you can justify a family
>    member as the exact prior distribution, use of the conjugate families
>    result only in an approximation.
    
    Correct, this is another approximation, but for the n>30 case I
    stipulated, it does not matter much.
    
>    In actuallity there is no effective difference in this situation
>    between the values found for the classical confidence interval, the
>    Bayesian confidence interval for uniform prior distribution, or for
>    the so called "fiducial confidence interval" (a sort of classical
>    imitation of Bayesian statistics).  Under reasonable assumptions they
>    give the same results (though interpretted differently).
    
    The differences become visible when the experimental outcome (survey
    results in this case) conveys information which is not large relative
    to the prior information.  Most folks just design their experiments
    so that won't be true, but sometimes you don't have that choice.
    
>    My caveat stands, its a handy approximation for hand calculation when
>    you have both enough "yes" and "no" answers, but better formula exist,
>    and, if you are writing a program, should be used.
    
>    For those who are interested the exact solution for the classical
>    symmetric confidence interval (which is, I'm reasonably certain, the
>    same as the uniform prior Bayesian confidence interval) is
    
>    		      x
>    	p1 = --------------------
>    	      x + (n - x + 1)*F1
>    
>    		 (x + 1)*F2
>    	p2 = --------------------
>    	      n - x + (x + 1)*F2
>    
>    where
>    
>    	x is the number of "yes" answers, n is the total number of answers
>    and
>    
>    	F1 = F[df1 = 2*(n-x+1), df2 = 2*x, C]
>    	F2 = F[df1 = 2*(x+1), df2 = 2*(n-x), C]
>    
>    and C is the confidence factor desired (e.g., .95 for a 95% confidence
>    interval), and F[df1, df2, C] is the value of the F distribution with
>    the given paramenters (df1 and df2) with the given cumulative
>    probability.
    
    Personally, I regard the above as an approximation to the beta
    distribution, satisfactory for large enough x, n, and x-n.  If I
    wanted an exact interval estimate, I would use the beta distribution
    and give the user the opportunity to select a prior distribution
    which reflects prior knowledge, if any, or no prior knowledge, if
    none.
    
    If I get a chance, I may check this equivalence.
    
    I'll still stick with the answer in .2:  for a real business situation,
    which .0 sounds like, all these statistical refinements are beside
    the point.  Furthermore, where real accuracy is required, decision 
    theory is usually more relevant than interval estimates.