T.R | Title | User | Personal Name | Date | Lines |
---|
1071.1 | There are only stupid answers... | CADSYS::COOPER | Topher Cooper | Mon May 01 1989 13:35 | 27 |
| It's fundamental and elementary but not at all stupid.
Calculate the following:
O - pN
Z = ---------
SQRT(Npq)
If your sample is large enough, then this will follow the standard
normal distribution. The variance is Npq and the standard deviation is
SQRT(Npq). Here, O is the number of yes answers (or no answers if you
prefer), N is the number of replies, p is the proportion of answers you
would expect by chance to be yes (or no), and q is (1-p). "Large
enough" means that both pN and qN are greater than 5 (or better yet
10).
If the Z-score calculated above is greater than 1.65, then you can
reject at the 95% confidence level the possibility of chance causing
the observed deviation.
Be careful with one-tailed tests -- be sure you understand what it
means to your problem if O - pN is negative (I assumed above that you
were looking for a positive deviation from chance levels, if you were
looking for one smaller than chance, then change the numerator to
(pN - O) and precede as before).
Topher
|
1071.2 | a slightly different approach | PULSAR::WALLY | Wally Neilsen-Steinhardt | Wed May 03 1989 19:02 | 110 |
| re: < Note 1071.0 by BESS::NAGARAJAN >
> I have a problem wherein I have data regarding from a survey. It
> is a YES or NO kind of data. Specifically, Do you have problems
> printing? 61% of the people surveyed said YES, the rest said NO.
> I am planning to do a hypothesis testing on this data to determine
> if indeed there is a problem, using the one-tail test.
My first reaction was to talk about Bayesian analysis and decision
theory, and I'll get back to that in a minute, but first
YES!!! we have a problem
Assuming any reasonable form of your survey question, if 61% of
the users say they have problems printing, then we have a problem.
Time to close the stat book and solve the problem with printing.
But let's assume that you need cooperation from others and you expect
those others to raise a few statistical questions, and you want
to be prepared.
I suggest that the best approach to this kind of situation is to
ignore hypothesis testing and follow a path through point estimate,
interval estimate and decision theory, stopping wherever you feel
comfortable. This is not a disagreement with .1, but a refocusing
of the problem.
I will assume, as .1 does, that there is some population of users
out there, some fraction of whom have problems printing, and that
you are interested in what you have learned about this fraction.
I also assume that you have done all the right things to get a valid
sample and asked the right questions in the right way.
POINT ESTIMATE
The best estimate of percent of the population having problems
printing is the sample percentage or 61%.
INTERVAL ESTIMATE
Now we need the standard deviation
> My question is, how do you calculate the standard deviation of such
> a data?? I recall vaguely that one can use a fixed std. deviation
> (forget how much it is..) when std deviation is not given or is
> unknown. Does any one remember what this figure is?
> Better yet, how can I find the std deviation of the above data.
The standard deviation is given by
SQRT( p * q / n )
where
p = fraction answering yes
q = 1-p = fraction answering no
n = sample size
provided n is reasonably large, usually taken to be n>30 or so.
Note the similarity to .1
Assuming that there were 100 people in your sample
SD = SQRT ( .61 * .39 / 100 ) = 0.049 or 4.9%
You can combine this with a table of cumulative standard normal
probabilities to define the confidence interval. For example, the
95% confidence interval is at 2 times SD, or in this case
[ 56.1% , 65.9% ]
or what may be more relevant to your situation, there is less than
a 0.3% chance that less than 56% of the population would give a
yes answer. In other words, you can be virtually sure that the
majority of users have problems printing.
I think this is probably enough to convince people to take action,
but if not, read on.
DECISION THEORY
I can't do much explicit in this area, without more information
about your problem, so this is going to get kind of sketchy.
The simplest approach is simply to ask how much customer
dissatisfaction we are willing to tolerate. If the answer is none,
then we close the stat book and solve the printing problem. If
the answer is some positive number then we decide whether it is
above, inside or below our confidence interval and take action
accordingly. Note that hypothesis testing may become relevant here,
if the tolerable value is close to the end point of the interval.
If the answer is that it depends on how much it will cost us to
leave it alone and how much it will cost us to fix, then we can
apply decision theory. We figure out the cost of trying to fix
a non-existent problem and the cost of not fixing a real problem.
Then we characterize the two situations (non-existent problem and
real problem) in terms of the expected outcome of a survey. We
apply Bayesian analysis to determine the probability of a real problem,
and decision theory to compute the expected cost of fixing vs doing
nothing. The arithmetic, when we are done, looks a lot like the
interval analysis above.
Hmm. That last paragraph was brief to the point of inscrutability,
but if you need to know more, you can look it up or ask in another
reply.
|
1071.3 | | AITG::DERAMO | Daniel V. {AITG,ZFC}:: D'Eramo | Wed May 03 1989 19:37 | 14 |
| re .2
>> probabilities to define the confidence interval. For example, the
>> 95% confidence interval is at 2 times SD, or in this case
>>
>> [ 56.1% , 65.9% ]
>>
>> or what may be more relevant to your situation, there is less than
>> a 0.3% chance that less than 56% of the population would give a
You meant "less than a 3% chance" there -- half of what's left
over when you aren't in the 95% confidence interval (half of 5%).
Dan
|
1071.4 | correction accepted | PULSAR::WALLY | Wally Neilsen-Steinhardt | Thu May 04 1989 14:35 | 5 |
| re: < Note 1071.3 by AITG::DERAMO "Daniel V. {AITG,ZFC}:: D'Eramo" >
> You meant "less than a 3% chance" there
correct. arithmetic is not my strong point
|
1071.5 | Building confidence. | CADSYS::COOPER | Topher Cooper | Thu May 04 1989 15:29 | 59 |
| RE: 1071.2 (Wally Neilsen-Steinhardt):
Using parameter estimates and decision theory rather than hypothesis
testing -- 100% correct, I should have listened to the underlying
question rather than the surface one.
Interval estimate -- sorry, not quite correct. Comments and corrections:
First off, a caveat. The method you give is an approximation of an
approximation. Technically it is the confidence interval for the
sample proportion when the underlying population proportion is known.
That is, if we know that the real proportion is 61% than we can be
95% sure that the proportion found in any particular sample of N will
lie between the bounds you calculate. What we really want is the
confidence interval for the unknown population proportion given that we
know a single sample proportion is 61%. The latter is a much more
complex calculation even after we have assumed a normal approximation.
However, when not only N, but y*N and (1-y)*N (with y being the
proportion of yes answers in a particular "experiment") is large
enough, then the former confidence interval is a good approximation
for the latter, swapping y and p around. I would say that both y*N and
(1-y)*N should be greater than 5, or, better yet, greater than 10.
Second: a clarification for some who might need it. The SD I gave in
.1, SQRT(pqN), is the standard deviation in the *count* of the number
of yes answers, while the SD you gave, SQRT(pq/N) is the standard
deviation in the *proportion* of the number of yes answers. It is
gotten by dividing the count SD by N.
Third: An actual error. You goofed on the Z value.
What we want here is a value for Z, so that the cumulative probability
of a standard normal variate being between -Z and +Z is .95. Most
tables of the cumulative standard normal distribution give the
probability that a particular sample will be less than a given Z.
Some manipulation based on the symmetry of the normal distribution gives
us that for a confidence interval of size C (e.g., C=.95), we look
for the Z value with a probability of 1 - (1-C)/2. For C=.95, we
want the Z value with cumulative probability of .975, which is just
about 1.96, or approximately 2 (as you said). But that is 2 standard
deviations below and two above for a total "width" of 4 standard
deviations, rather than 1 below and 1 above for a width of 2.
The confidence interval should be, therefore:
[51.2%, 70.8%]
What may have confused you is the custom in physics of reporting values
with +or- "error bounds" of one standard deviation. If the errors
follow a normal distribution this corresponds to a confidence interval
of the totally arbitrary size of .68, and when they do not it is
generally completely meaningless. Since there is usually no indication
of the distributions of the errors given (and contrary to belief 60
years or so ago when the practice grew up, normal is *not* normal for
errors) use of these quantities is a demonstration of the irrationality
of physicists (but wait, we're having *that* discussion in another
conference).
Topher
|
1071.6 | Bayesian intervals on proportions | PULSAR::WALLY | Wally Neilsen-Steinhardt | Fri May 05 1989 12:56 | 44 |
| re: < Note 1071.5 by CADSYS::COOPER "Topher Cooper" >
> First off, a caveat. The method you give is an approximation of an
> approximation.
What I did is actually much more correct than it looks, because
I did not burden my initial reply with a lot of details.
What I really did was compute the interval for the unknown population
proportion. As a good Bayesian, I have no trouble doing this, although
any frequentist readers might find it hard to swallow. As a good
Bayesian, I am supposed to call this a credible interval, but that's
too pedantic for me.
The calculation is actually amazingly simple, since this is one
of those cases where a conjugate prior distribution has been found.
In this case it is the beta distribution, which you can look up,
since if I try to type in the definition, I'll just forget a r-1
somewhere. For small n, r, or n-r, I can use the exact form of
the beta distribution, but if they are all large, I can use a normal
approximation. Amazingly, the normal approximation to the beta
is quite close to the normal approximation to the binomial, so it
looks like I have just done the 'approximation to an approximation'
described above. But really, I have solved a problem exactly in
terms of a beta distribution, and then approximated that with a
normal distribution.
> Second: a clarification for some who might need it. The SD I gave in
> .1, SQRT(pqN), is the standard deviation in the *count* of the number
> of yes answers, while the SD you gave, SQRT(pq/N) is the standard
> deviation in the *proportion* of the number of yes answers. It is
> gotten by dividing the count SD by N.
quite correct.
> Third: An actual error. You goofed on the Z value.
Alas, this is correct too. .5 has an interesting hypothesis on
how I could have made this error in a smart way, but I've made that
smart error too often, so I checked my table carefully. But then
I made the dumb error: I just forgot to multiply the SD by 2 in
my calculator. Those who wish to avoid the smart error should read
.5 carefully. To those who wish to avoid the dumb error, I can
offer no advice.
|
1071.7 | | AITG::DERAMO | Daniel V. {AITG,ZFC}:: D'Eramo | Fri May 05 1989 14:14 | 26 |
| >> > First off, a caveat. The method you give is an approximation of an
>> > approximation.
This was talking about taking the standard deviation,
sqrt(NPQ), where P is the probability of a yes answer,
Q = 1 - P, and N is the number of people asked. The
problem was that P was not known. So instead of the
unknowns P and Q = 1 - P, he used p and q = 1 - p where
p = number_of_yes_answers / N. So p is a good estimate
for P and q is a good estimate for Q. The caveat was
about having done this -- you didn't use the real standard
deviation.
An alternative is to notice that the standard deviation
is largest when PQ = P(1 - P) is largest, which happens
when P = Q = 0.5. In that case SD <= sqrt(N / 4) =
sqrt(N) / 2. This method "plays it safe" by using the
largest possible standard deviation of a binomial
distribution. The error bounds determined this way may
end up being a little larger than necessary. But that's
okay if it is more important that the bounds include
the (unknown) value of P than that the bounds be small.
(The bounds being "P is xx% likely to be in the range
p +/- yy standard deviations.")
Dan
|
1071.8 | asymptotic approach to clarity | PULSAR::WALLY | Wally Neilsen-Steinhardt | Mon May 08 1989 13:04 | 10 |
| re: < Note 1071.7 by AITG::DERAMO "Daniel V. {AITG,ZFC}:: D'Eramo" >
Dan,
I think we are talking at cross-purposes here. Can you please reread
.6? I thought that I answered this comment there. If not, I'll
try again. Let me know what you have comments about, questions
on, or disagree with.
Wally
|
1071.9 | | AITG::DERAMO | Daniel V. {AITG,ZFC}:: D'Eramo | Mon May 08 1989 13:46 | 13 |
| Wally,
.6 didn't say what the caveat was about so I described it,
not remembering that it was all spelled out in .5. My
alternative answer to it was directed to those who don't
know what a beta distribution is or why it applies -- just
use the known upper bound on the standard deviation instead.
>> Let me know what you have ... questions on ....
What's a beta distribution and why does it apply here? :-)
Dan
|
1071.10 | You beta your life! | CADSYS::COOPER | Topher Cooper | Mon May 08 1989 14:24 | 17 |
| RE: .9 (Dan)
Basically, the beta distribution is a continuous generalization of the
binomial distribution. Where the binomial distribution uses factorial,
the beta distribution uses the gamma function. Since beta is
continuous it is "integratable" and differentiable unlike the binomial
distribution. This allows various problems to be solved for it, which
can be re-specialized to integer values -- i.e., to the binomial
distribution.
Actually, solving the problem in terms of the beta distribution,
results in a solution expressible in terms of the more familiar Fisher
F distribution (which is another special case of the beta
distribution). Tables of the F distribution can be found in most
elementary statistics books.
Topher
|
1071.11 | If I follow you... | CADSYS::COOPER | Topher Cooper | Mon May 08 1989 16:01 | 92 |
| RE: .6
>> First off, a caveat. The method you give is an approximation of an
>> approximation.
>
> ...
>
> The calculation is actually amazingly simple, since this is one
> of those cases where a conjugate prior distribution has been found.
OK, I was wrong. It wasn't an approximation of an approximation of
a classical confidence interval. Its an approximation to the exact
solution to an approximation of a Bayesian confidence interval (a.k.a.,
credibility interval or credible interval). ;-) (Unless I've
misinterpretted what you were saying, of course).
Use of a conjugate prior distribution is only justified in general when
the amount of evidence to be considered is large enough so that the
effect of the specific choice of initial prior distribution effectively
vanishes. In the case of binomial sampling, the amount of evidence
is N, the number of samples.
For the non Bayesians in the crowd:
In some cases, for a particular Bayesian calculation, you can find a
family of distributions with the property that if the *prior*
distribution is a member of that family than the *final* distribution
is also. Such a family is called a conjugate distribution family for
the distribution on which the calculation is based (actually there is
a bit more to it than that, but that's the main part).
Use of a member of a conjugate distribution family as the prior
distribution results in clean analytic solutions to Bayesian inference
problems. In particular, sometimes there is a member of the family
which you can extrapolate down to representing "no prior information",
and this makes for particularly clean solutions (note, however, that
in general, assuming this specific member of the family involves
assuming that one value is *a priori* more likely than another without
any justification at all, and that a different way of accounting for
the evidence would call for a different initial distribution which
would give contradtory liklihoods for those same values).
Sometimes the family is rich enough so that you can frequently pick
a member which closely approximates your "real" prior expectations,
and sometimes you have enough evidence so that it doesn't matter very
much what your "real" prior was. But unless you can justify a family
member as the exact prior distribution, use of the conjugate families
result only in an approximation.
In general, the only distribution justified non-subjectively is an
appropriate uniform one -- that being justified by the "principle of
prior ignorance."
Further comments:
In actuallity there is no effective difference in this situation
between the values found for the classical confidence interval, the
Bayesian confidence interval for uniform prior distribution, or for
the so called "fiducial confidence interval" (a sort of classical
imitation of Bayesian statistics). Under reasonable assumptions they
give the same results (though interpretted differently).
My caveat stands, its a handy approximation for hand calculation when
you have both enough "yes" and "no" answers, but better formula exist,
and, if you are writing a program, should be used.
For those who are interested the exact solution for the classical
symmetric confidence interval (which is, I'm reasonably certain, the
same as the uniform prior Bayesian confidence interval) is
x
p1 = --------------------
x + (n - x + 1)*F1
(x + 1)*F2
p2 = --------------------
n - x + (x + 1)*F2
where
x is the number of "yes" answers, n is the total number of answers
and
F1 = F[df1 = 2*(n-x+1), df2 = 2*x, C]
F2 = F[df1 = 2*(x+1), df2 = 2*(n-x), C]
and C is the confidence factor desired (e.g., .95 for a 95% confidence
interval), and F[df1, df2, C] is the value of the F distribution with
the given paramenters (df1 and df2) with the given cumulative
probability.
Topher
|
1071.12 | more on the beta | PULSAR::WALLY | Wally Neilsen-Steinhardt | Mon May 08 1989 18:32 | 83 |
| My beta distribution is rather different from that described in
.10, and my procedure is rather different as well.
My beta is a parametric family of functions with variable p and
parameters n and r:
f(p; n,r) = (n-1)! * p^(r-1) * (1-p)^(n-r-1)
--------------------------------
(r-1)! * (n-r-1)!
where 0<=p<=1 and the function is zero elsewhere.
re: < Note 1071.11 by CADSYS::COOPER "Topher Cooper" >
> In particular, sometimes there is a member of the family
> which you can extrapolate down to representing "no prior information",
> and this makes for particularly clean solutions
And this is the case for the beta above, where r=1, n=2 gives a
uniform function of p, or uniform prior distribution.
> But unless you can justify a family
> member as the exact prior distribution, use of the conjugate families
> result only in an approximation.
Correct, this is another approximation, but for the n>30 case I
stipulated, it does not matter much.
> In actuallity there is no effective difference in this situation
> between the values found for the classical confidence interval, the
> Bayesian confidence interval for uniform prior distribution, or for
> the so called "fiducial confidence interval" (a sort of classical
> imitation of Bayesian statistics). Under reasonable assumptions they
> give the same results (though interpretted differently).
The differences become visible when the experimental outcome (survey
results in this case) conveys information which is not large relative
to the prior information. Most folks just design their experiments
so that won't be true, but sometimes you don't have that choice.
> My caveat stands, its a handy approximation for hand calculation when
> you have both enough "yes" and "no" answers, but better formula exist,
> and, if you are writing a program, should be used.
> For those who are interested the exact solution for the classical
> symmetric confidence interval (which is, I'm reasonably certain, the
> same as the uniform prior Bayesian confidence interval) is
> x
> p1 = --------------------
> x + (n - x + 1)*F1
>
> (x + 1)*F2
> p2 = --------------------
> n - x + (x + 1)*F2
>
> where
>
> x is the number of "yes" answers, n is the total number of answers
> and
>
> F1 = F[df1 = 2*(n-x+1), df2 = 2*x, C]
> F2 = F[df1 = 2*(x+1), df2 = 2*(n-x), C]
>
> and C is the confidence factor desired (e.g., .95 for a 95% confidence
> interval), and F[df1, df2, C] is the value of the F distribution with
> the given paramenters (df1 and df2) with the given cumulative
> probability.
Personally, I regard the above as an approximation to the beta
distribution, satisfactory for large enough x, n, and x-n. If I
wanted an exact interval estimate, I would use the beta distribution
and give the user the opportunity to select a prior distribution
which reflects prior knowledge, if any, or no prior knowledge, if
none.
If I get a chance, I may check this equivalence.
I'll still stick with the answer in .2: for a real business situation,
which .0 sounds like, all these statistical refinements are beside
the point. Furthermore, where real accuracy is required, decision
theory is usually more relevant than interval estimates.
|