[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1071.0. "Std. Deviation of Boolean data?" by BESS::NAGARAJAN () Mon May 01 1989 12:02

             
    This is a statistics problem.
    
    This may really sound like a fundamental/stupid question .. 
    
    I have a problem wherein I have data regarding from a survey. It
    is a YES or NO kind of data. Specifically, Do you have problems
    printing? 61% of the people surveyed said YES, the rest said NO.
    I am planning to do a hypothesis testing on this data to determine
    if indeed there is a problem, using the one-tail test.
    
    My question is, how do you calculate the standard deviation of such
    a data?? I recall vaguely that one can use a fixed std. deviation
    (forget how much it is..) when std deviation is not given or is
    unknown. Does any one remember what this figure is?
    
    Better yet, how can I find the std deviation of the above data.
                          
    Thanx for any pointers!!
    
    -gopal-

T.R	Title	User	Personal Name	Date	Lines
1071.1	There are only stupid answers...	CADSYS::COOPER	Topher Cooper	`Mon May 01 1989 12:35`	27
	It's fundamental and elementary but not at all stupid. Calculate the following: O - pN Z = --------- SQRT(Npq) If your sample is large enough, then this will follow the standard normal distribution. The variance is Npq and the standard deviation is SQRT(Npq). Here, O is the number of yes answers (or no answers if you prefer), N is the number of replies, p is the proportion of answers you would expect by chance to be yes (or no), and q is (1-p). "Large enough" means that both pN and qN are greater than 5 (or better yet 10). If the Z-score calculated above is greater than 1.65, then you can reject at the 95% confidence level the possibility of chance causing the observed deviation. Be careful with one-tailed tests -- be sure you understand what it means to your problem if O - pN is negative (I assumed above that you were looking for a positive deviation from chance levels, if you were looking for one smaller than chance, then change the numerator to (pN - O) and precede as before). Topher
1071.2	a slightly different approach	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Wed May 03 1989 18:02`	110
	re: < Note 1071.0 by BESS::NAGARAJAN > > I have a problem wherein I have data regarding from a survey. It > is a YES or NO kind of data. Specifically, Do you have problems > printing? 61% of the people surveyed said YES, the rest said NO. > I am planning to do a hypothesis testing on this data to determine > if indeed there is a problem, using the one-tail test. My first reaction was to talk about Bayesian analysis and decision theory, and I'll get back to that in a minute, but first YES!!! we have a problem Assuming any reasonable form of your survey question, if 61% of the users say they have problems printing, then we have a problem. Time to close the stat book and solve the problem with printing. But let's assume that you need cooperation from others and you expect those others to raise a few statistical questions, and you want to be prepared. I suggest that the best approach to this kind of situation is to ignore hypothesis testing and follow a path through point estimate, interval estimate and decision theory, stopping wherever you feel comfortable. This is not a disagreement with .1, but a refocusing of the problem. I will assume, as .1 does, that there is some population of users out there, some fraction of whom have problems printing, and that you are interested in what you have learned about this fraction. I also assume that you have done all the right things to get a valid sample and asked the right questions in the right way. POINT ESTIMATE The best estimate of percent of the population having problems printing is the sample percentage or 61%. INTERVAL ESTIMATE Now we need the standard deviation > My question is, how do you calculate the standard deviation of such > a data?? I recall vaguely that one can use a fixed std. deviation > (forget how much it is..) when std deviation is not given or is > unknown. Does any one remember what this figure is? > Better yet, how can I find the std deviation of the above data. The standard deviation is given by SQRT( p * q / n ) where p = fraction answering yes q = 1-p = fraction answering no n = sample size provided n is reasonably large, usually taken to be n>30 or so. Note the similarity to .1 Assuming that there were 100 people in your sample SD = SQRT ( .61 * .39 / 100 ) = 0.049 or 4.9% You can combine this with a table of cumulative standard normal probabilities to define the confidence interval. For example, the 95% confidence interval is at 2 times SD, or in this case [ 56.1% , 65.9% ] or what may be more relevant to your situation, there is less than a 0.3% chance that less than 56% of the population would give a yes answer. In other words, you can be virtually sure that the majority of users have problems printing. I think this is probably enough to convince people to take action, but if not, read on. DECISION THEORY I can't do much explicit in this area, without more information about your problem, so this is going to get kind of sketchy. The simplest approach is simply to ask how much customer dissatisfaction we are willing to tolerate. If the answer is none, then we close the stat book and solve the printing problem. If the answer is some positive number then we decide whether it is above, inside or below our confidence interval and take action accordingly. Note that hypothesis testing may become relevant here, if the tolerable value is close to the end point of the interval. If the answer is that it depends on how much it will cost us to leave it alone and how much it will cost us to fix, then we can apply decision theory. We figure out the cost of trying to fix a non-existent problem and the cost of not fixing a real problem. Then we characterize the two situations (non-existent problem and real problem) in terms of the expected outcome of a survey. We apply Bayesian analysis to determine the probability of a real problem, and decision theory to compute the expected cost of fixing vs doing nothing. The arithmetic, when we are done, looks a lot like the interval analysis above. Hmm. That last paragraph was brief to the point of inscrutability, but if you need to know more, you can look it up or ask in another reply.
1071.3		AITG::DERAMO	Daniel V. {AITG,ZFC}:: D'Eramo	`Wed May 03 1989 18:37`	14
	re .2 >> probabilities to define the confidence interval. For example, the >> 95% confidence interval is at 2 times SD, or in this case >> >> [ 56.1% , 65.9% ] >> >> or what may be more relevant to your situation, there is less than >> a 0.3% chance that less than 56% of the population would give a You meant "less than a 3% chance" there -- half of what's left over when you aren't in the 95% confidence interval (half of 5%). Dan
1071.4	correction accepted	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Thu May 04 1989 13:35`	5
	re: < Note 1071.3 by AITG::DERAMO "Daniel V. {AITG,ZFC}:: D'Eramo" > > You meant "less than a 3% chance" there correct. arithmetic is not my strong point
1071.5	Building confidence.	CADSYS::COOPER	Topher Cooper	`Thu May 04 1989 14:29`	59
	RE: 1071.2 (Wally Neilsen-Steinhardt): Using parameter estimates and decision theory rather than hypothesis testing -- 100% correct, I should have listened to the underlying question rather than the surface one. Interval estimate -- sorry, not quite correct. Comments and corrections: First off, a caveat. The method you give is an approximation of an approximation. Technically it is the confidence interval for the sample proportion when the underlying population proportion is known. That is, if we know that the real proportion is 61% than we can be 95% sure that the proportion found in any particular sample of N will lie between the bounds you calculate. What we really want is the confidence interval for the unknown population proportion given that we know a single sample proportion is 61%. The latter is a much more complex calculation even after we have assumed a normal approximation. However, when not only N, but yN and (1-y)N (with y being the proportion of yes answers in a particular "experiment") is large enough, then the former confidence interval is a good approximation for the latter, swapping y and p around. I would say that both yN and (1-y)N should be greater than 5, or, better yet, greater than 10. Second: a clarification for some who might need it. The SD I gave in .1, SQRT(pqN), is the standard deviation in the count of the number of yes answers, while the SD you gave, SQRT(pq/N) is the standard deviation in the proportion of the number of yes answers. It is gotten by dividing the count SD by N. Third: An actual error. You goofed on the Z value. What we want here is a value for Z, so that the cumulative probability of a standard normal variate being between -Z and +Z is .95. Most tables of the cumulative standard normal distribution give the probability that a particular sample will be less than a given Z. Some manipulation based on the symmetry of the normal distribution gives us that for a confidence interval of size C (e.g., C=.95), we look for the Z value with a probability of 1 - (1-C)/2. For C=.95, we want the Z value with cumulative probability of .975, which is just about 1.96, or approximately 2 (as you said). But that is 2 standard deviations below and two above for a total "width" of 4 standard deviations, rather than 1 below and 1 above for a width of 2. The confidence interval should be, therefore: [51.2%, 70.8%] What may have confused you is the custom in physics of reporting values with +or- "error bounds" of one standard deviation. If the errors follow a normal distribution this corresponds to a confidence interval of the totally arbitrary size of .68, and when they do not it is generally completely meaningless. Since there is usually no indication of the distributions of the errors given (and contrary to belief 60 years or so ago when the practice grew up, normal is not normal for errors) use of these quantities is a demonstration of the irrationality of physicists (but wait, we're having that discussion in another conference). Topher
1071.6	Bayesian intervals on proportions	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Fri May 05 1989 11:56`	44
	re: < Note 1071.5 by CADSYS::COOPER "Topher Cooper" > > First off, a caveat. The method you give is an approximation of an > approximation. What I did is actually much more correct than it looks, because I did not burden my initial reply with a lot of details. What I really did was compute the interval for the unknown population proportion. As a good Bayesian, I have no trouble doing this, although any frequentist readers might find it hard to swallow. As a good Bayesian, I am supposed to call this a credible interval, but that's too pedantic for me. The calculation is actually amazingly simple, since this is one of those cases where a conjugate prior distribution has been found. In this case it is the beta distribution, which you can look up, since if I try to type in the definition, I'll just forget a r-1 somewhere. For small n, r, or n-r, I can use the exact form of the beta distribution, but if they are all large, I can use a normal approximation. Amazingly, the normal approximation to the beta is quite close to the normal approximation to the binomial, so it looks like I have just done the 'approximation to an approximation' described above. But really, I have solved a problem exactly in terms of a beta distribution, and then approximated that with a normal distribution. > Second: a clarification for some who might need it. The SD I gave in > .1, SQRT(pqN), is the standard deviation in the count of the number > of yes answers, while the SD you gave, SQRT(pq/N) is the standard > deviation in the proportion of the number of yes answers. It is > gotten by dividing the count SD by N. quite correct. > Third: An actual error. You goofed on the Z value. Alas, this is correct too. .5 has an interesting hypothesis on how I could have made this error in a smart way, but I've made that smart error too often, so I checked my table carefully. But then I made the dumb error: I just forgot to multiply the SD by 2 in my calculator. Those who wish to avoid the smart error should read .5 carefully. To those who wish to avoid the dumb error, I can offer no advice.
1071.7		AITG::DERAMO	Daniel V. {AITG,ZFC}:: D'Eramo	`Fri May 05 1989 13:14`	26
	>> > First off, a caveat. The method you give is an approximation of an >> > approximation. This was talking about taking the standard deviation, sqrt(NPQ), where P is the probability of a yes answer, Q = 1 - P, and N is the number of people asked. The problem was that P was not known. So instead of the unknowns P and Q = 1 - P, he used p and q = 1 - p where p = number_of_yes_answers / N. So p is a good estimate for P and q is a good estimate for Q. The caveat was about having done this -- you didn't use the real standard deviation. An alternative is to notice that the standard deviation is largest when PQ = P(1 - P) is largest, which happens when P = Q = 0.5. In that case SD <= sqrt(N / 4) = sqrt(N) / 2. This method "plays it safe" by using the largest possible standard deviation of a binomial distribution. The error bounds determined this way may end up being a little larger than necessary. But that's okay if it is more important that the bounds include the (unknown) value of P than that the bounds be small. (The bounds being "P is xx% likely to be in the range p +/- yy standard deviations.") Dan
1071.8	asymptotic approach to clarity	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Mon May 08 1989 12:04`	10
	re: < Note 1071.7 by AITG::DERAMO "Daniel V. {AITG,ZFC}:: D'Eramo" > Dan, I think we are talking at cross-purposes here. Can you please reread .6? I thought that I answered this comment there. If not, I'll try again. Let me know what you have comments about, questions on, or disagree with. Wally
1071.9		AITG::DERAMO	Daniel V. {AITG,ZFC}:: D'Eramo	`Mon May 08 1989 12:46`	13
	Wally, .6 didn't say what the caveat was about so I described it, not remembering that it was all spelled out in .5. My alternative answer to it was directed to those who don't know what a beta distribution is or why it applies -- just use the known upper bound on the standard deviation instead. >> Let me know what you have ... questions on .... What's a beta distribution and why does it apply here? :-) Dan
1071.10	You beta your life!	CADSYS::COOPER	Topher Cooper	`Mon May 08 1989 13:24`	17
	RE: .9 (Dan) Basically, the beta distribution is a continuous generalization of the binomial distribution. Where the binomial distribution uses factorial, the beta distribution uses the gamma function. Since beta is continuous it is "integratable" and differentiable unlike the binomial distribution. This allows various problems to be solved for it, which can be re-specialized to integer values -- i.e., to the binomial distribution. Actually, solving the problem in terms of the beta distribution, results in a solution expressible in terms of the more familiar Fisher F distribution (which is another special case of the beta distribution). Tables of the F distribution can be found in most elementary statistics books. Topher
1071.11	If I follow you...	CADSYS::COOPER	Topher Cooper	`Mon May 08 1989 15:01`	92
	RE: .6 >> First off, a caveat. The method you give is an approximation of an >> approximation. > > ... > > The calculation is actually amazingly simple, since this is one > of those cases where a conjugate prior distribution has been found. OK, I was wrong. It wasn't an approximation of an approximation of a classical confidence interval. Its an approximation to the exact solution to an approximation of a Bayesian confidence interval (a.k.a., credibility interval or credible interval). ;-) (Unless I've misinterpretted what you were saying, of course). Use of a conjugate prior distribution is only justified in general when the amount of evidence to be considered is large enough so that the effect of the specific choice of initial prior distribution effectively vanishes. In the case of binomial sampling, the amount of evidence is N, the number of samples. For the non Bayesians in the crowd: In some cases, for a particular Bayesian calculation, you can find a family of distributions with the property that if the prior distribution is a member of that family than the final distribution is also. Such a family is called a conjugate distribution family for the distribution on which the calculation is based (actually there is a bit more to it than that, but that's the main part). Use of a member of a conjugate distribution family as the prior distribution results in clean analytic solutions to Bayesian inference problems. In particular, sometimes there is a member of the family which you can extrapolate down to representing "no prior information", and this makes for particularly clean solutions (note, however, that in general, assuming this specific member of the family involves assuming that one value is a priori more likely than another without any justification at all, and that a different way of accounting for the evidence would call for a different initial distribution which would give contradtory liklihoods for those same values). Sometimes the family is rich enough so that you can frequently pick a member which closely approximates your "real" prior expectations, and sometimes you have enough evidence so that it doesn't matter very much what your "real" prior was. But unless you can justify a family member as the exact prior distribution, use of the conjugate families result only in an approximation. In general, the only distribution justified non-subjectively is an appropriate uniform one -- that being justified by the "principle of prior ignorance." Further comments: In actuallity there is no effective difference in this situation between the values found for the classical confidence interval, the Bayesian confidence interval for uniform prior distribution, or for the so called "fiducial confidence interval" (a sort of classical imitation of Bayesian statistics). Under reasonable assumptions they give the same results (though interpretted differently). My caveat stands, its a handy approximation for hand calculation when you have both enough "yes" and "no" answers, but better formula exist, and, if you are writing a program, should be used. For those who are interested the exact solution for the classical symmetric confidence interval (which is, I'm reasonably certain, the same as the uniform prior Bayesian confidence interval) is x p1 = -------------------- x + (n - x + 1)F1 (x + 1)F2 p2 = -------------------- n - x + (x + 1)F2 where x is the number of "yes" answers, n is the total number of answers and F1 = F[df1 = 2(n-x+1), df2 = 2x, C] F2 = F[df1 = 2(x+1), df2 = 2*(n-x), C] and C is the confidence factor desired (e.g., .95 for a 95% confidence interval), and F[df1, df2, C] is the value of the F distribution with the given paramenters (df1 and df2) with the given cumulative probability. Topher
1071.12	more on the beta	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Mon May 08 1989 17:32`	83
	My beta distribution is rather different from that described in .10, and my procedure is rather different as well. My beta is a parametric family of functions with variable p and parameters n and r: f(p; n,r) = (n-1)! * p^(r-1) * (1-p)^(n-r-1) -------------------------------- (r-1)! * (n-r-1)! where 0<=p<=1 and the function is zero elsewhere. re: < Note 1071.11 by CADSYS::COOPER "Topher Cooper" > > In particular, sometimes there is a member of the family > which you can extrapolate down to representing "no prior information", > and this makes for particularly clean solutions And this is the case for the beta above, where r=1, n=2 gives a uniform function of p, or uniform prior distribution. > But unless you can justify a family > member as the exact prior distribution, use of the conjugate families > result only in an approximation. Correct, this is another approximation, but for the n>30 case I stipulated, it does not matter much. > In actuallity there is no effective difference in this situation > between the values found for the classical confidence interval, the > Bayesian confidence interval for uniform prior distribution, or for > the so called "fiducial confidence interval" (a sort of classical > imitation of Bayesian statistics). Under reasonable assumptions they > give the same results (though interpretted differently). The differences become visible when the experimental outcome (survey results in this case) conveys information which is not large relative to the prior information. Most folks just design their experiments so that won't be true, but sometimes you don't have that choice. > My caveat stands, its a handy approximation for hand calculation when > you have both enough "yes" and "no" answers, but better formula exist, > and, if you are writing a program, should be used. > For those who are interested the exact solution for the classical > symmetric confidence interval (which is, I'm reasonably certain, the > same as the uniform prior Bayesian confidence interval) is > x > p1 = -------------------- > x + (n - x + 1)F1 > > (x + 1)F2 > p2 = -------------------- > n - x + (x + 1)F2 > > where > > x is the number of "yes" answers, n is the total number of answers > and > > F1 = F[df1 = 2(n-x+1), df2 = 2x, C] > F2 = F[df1 = 2(x+1), df2 = 2*(n-x), C] > > and C is the confidence factor desired (e.g., .95 for a 95% confidence > interval), and F[df1, df2, C] is the value of the F distribution with > the given paramenters (df1 and df2) with the given cumulative > probability. Personally, I regard the above as an approximation to the beta distribution, satisfactory for large enough x, n, and x-n. If I wanted an exact interval estimate, I would use the beta distribution and give the user the opportunity to select a prior distribution which reflects prior knowledge, if any, or no prior knowledge, if none. If I get a chance, I may check this equivalence. I'll still stick with the answer in .2: for a real business situation, which .0 sounds like, all these statistical refinements are beside the point. Furthermore, where real accuracy is required, decision theory is usually more relevant than interval estimates.