T.R | Title | User | Personal Name | Date | Lines |
---|
1145.1 | | AITG::DERAMO | like a candle in the wind | Mon Oct 30 1989 23:03 | 40 |
| I thought that the Chi Squared test was for "goodness of
fit" as opposed to correlation. You suspect an
experiment can have one of k results, with probabilities
p[1] through p[k]. You perform N independent trials, and
observe each of the k results Obs[i] times, 1 <= i <= k.
The expected value of the number of times result i
occurred is Exp[i] = N * p[i]. The Chi Squared statistic
for this is the sum over 1 <= i <= k of
(Obs[i] - Exp[i])^2 / Exp[i].
As N -> oo, the value computed for this statistic will
approach a distribution known as Chi Squared with k-1
degrees of freedom (the distribution is that of the sum
of k-1 independent normally distributed random values) if
the assumptions were correct. If the true probabilities
weren't the p[i], or the trials weren't independent, then
the computed statistic tends to differ more and more from
this as N increases.
You can use this to show that two things are "correlated"
by assuming they are independent and doing a Chi Squared
test that "fails", i.e., is so out of range that you feel
safe in concluding the two factors are not independent.
The above described a one dimensional test. You can set
up a two dimenionsal test as follows. One dimension is
one of k1 classes with probabilities p[i]. The other is
one of k2 classes with probabilities q[i]. Now "cell i,j"
for 1 <= i <= k1, 1 <= j <= k2, has observed count
Obs[i,j] and expected count N * p[i] * q[j] after N
independent trials. Again compute the sum over all of
the cells of (Obs - Exp)^2 / Exp. In this case, if the
assumptions are correct, then as N -> oo the computed
statistic has a Chi Squared distribution with (k1 - 1)*(k2 - 1)
degrees of freedom. [One of the "assumptions" is that
the class of the first factor is independent of the class
of the second factor. If the test shows they are not
independent then in some sense they are "correlated".]
Dan
|
1145.2 | I happen to have this problem ... | VMSDEV::HALLYB | The Smart Money was on Goliath | Wed May 22 1991 14:39 | 5 |
| OK, I have 80 "bins" that should uniformly share 281 samples.
I calculate the chi-square value as 57.5
Tables only go to 30 or so, but it looks like I have 79 degrees O'freedom.
What next?
|
1145.3 | | GUESS::DERAMO | Be excellent to each other. | Wed May 22 1991 14:59 | 34 |
| re .1
>> As N -> oo, the value computed for this statistic will
>> approach a distribution known as Chi Squared with k-1
>> degrees of freedom (the distribution is that of the sum
>> of k-1 independent normally distributed random values) if
>> the assumptions were correct.
Eeeek! Make that, the distribution is that of the sum of
the squares of k-1 independent normally distributed
random variables.
One thing I left out of .1 is that it is suggested each
bin have a minimum expected value of at least 5. If you
don't have enough independent trials for that, then
combine bins or get more independent trials.
re .2
>> OK, I have 80 "bins" that should uniformly share 281 samples.
Ummm, gee, you should combine bins or use more samples. :-)
>> Tables only go to 30 or so, but it looks like I have 79 degrees O'freedom.
I thought the tables usually gave a formula for computing
the levels at higher degrees of freedom. If so, then
plug in to the formula. If not, then a Chi squared with
f degrees of freedom has mean f and variance 2f
[variance, not standard deviation] and for large f can
itself be approximated by a normal distribution with
those parameters. I don't know if 79 is large enough.
Dan
|
1145.4 | formula in Knuth v2, 2ed. | TOOK::CBRADLEY | Chuck Bradley | Thu May 23 1991 10:42 | 7 |
|
>> Tables only go to 30 or so, but it looks like I have 79 degrees O'freedom.
i had a similar problem last year. the tables i consulted did not have
a formula for large v. I finally found one in Knuth v2, in the section on
testing random number generators. Beware, the formula in 1st ed. is wrong.
Get the latest edition.
|
1145.5 | Formulas. | CADSYS::COOPER | Topher Cooper | Thu May 23 1991 16:21 | 22 |
| The usual approximation given in tables of chi-square for df>30 is
that
sqrt(2*X) - sqrt(2*df - 1)
is approximately normally distributed. This will give you about two
decimal places of accuracy for df>30 and p<.995. A better
approximation (especially if you are going to put it in code) is that:
1/3
(X/df) - (1 - 2/(9df))
---------------------------
sqrt(2/(9df))
is also approximately normally distributed. This will give you about
4 places of accuracy under the same conditions (yes, I looked it up).
The straight approximation of (X-df)/df is not terribly useful. With
79 degrees of freedom, the .99 alpha point (i.e., the correct answer
should be .99) comes out as .65 -- a tad off.
Topher
|
1145.6 | from some other tables | CSSE::NEILSEN | Wally Neilsen-Steinhardt | Tue May 28 1991 13:17 | 12 |
| .2> What next?
Find some new tables. Winkler and Hays, oft cited by me, includes a table that
goes up to 100, in steps of 10.
For n=80, 57.2 corresponds to a fractile of 0.025.
60.4 " 0.05
As .3 says, you have too few samples to satisfy the usual rule of thumb of
5 predicted samples per bin.
|
1145.7 | X^2 for uniform distribution problem | VMSDEV::HALLYB | Fish have no concept of fire | Mon Nov 25 1991 14:40 | 18 |
| Let me go back to one of the questions in .0 -- how do you "choose"
whether to use .01 or .05? If the answer is "subjectively", might
there be some sort of reference work of past case studies where tests
are described and the authors say "We picked a confidence level of 95%
because ..." ?
Is it correct to say the choice must be made before the result is known?
In some sense isn't it "cheating" to look up the confidence level before
deciding what level to use, thereby allowing one to choose the largest
successful level?
If so, can we carry this one step further as follows: given 5 "bins"
and an expectation of 80 per bin, I observe one bin comes in at 45.
Without reference to the tables I'm already influenced by that datum
and am tempted to use, say, 99.9% as my confidence level. Is that
"cheating"?
John
|
1145.8 | | COOKIE::PBERGH | Peter Bergh, DTN 523-3007 | Mon Nov 25 1991 16:23 | 19 |
| <<< Note 1145.7 by VMSDEV::HALLYB "Fish have no concept of fire" >>>
-< X^2 for uniform distribution problem >-
>> Let me go back to one of the questions in .0 -- how do you "choose"
>> whether to use .01 or .05?
The confidence level is an estimate of the probability of being wrong if we
reject the null hypothesis (i.e., in the chase of X^2, "the same distribution
as ..."). Thus, you don't choose; your data "tell" you.
>> Is it correct to say the choice must be made before the result is known?
No; the data tell you what the confidence level is.
When using the results of the test, however, you have to decide what risk of
being wrong you're willing to accept. Thus, if some important decision hinges
on the outcome of the test, you'll want a high level of confidence (i.e., a low
probability of being wrong). What risk you're willing to accept is outside the
realm of statistics.
|
1145.9 | confindence and significance. | CADSYS::COOPER | Topher Cooper | Mon Nov 25 1991 17:46 | 63 |
| RE: .7, .8
There is some terminological confusion level here -- not surprising
because in informal discussion terms which are technically quite
distinct are used interchangeably.
"Confidence" levels refer to "confidence intervals" which is part of
parameter estimation rather than to hypothesis testing.
When one performs a statistical test one calculates a "p-value" which
is sometimes refered to as the "significance" of the test applied to
the data. Informally one might even refer to it as the "significance
level" of the test applied to the data.
Some contemporary statisticians recommend that you stop there -- that
the end point of the statistical procedure is a p-value, and what
follows is interpretation. This is refered to as "signficance
testing". There is some justice to this, but also a sort of
indefiniteness to the results.
The more traditional approach is "hypothesis testing". One selects in
advance -- before even seeing the data -- a particular significance
criterion, or alpha, or sometimes significance level. If the p-value
is less than that criterion then the results are declared "significant"
which means, more or less, "we can consider them as real rather than
as a chance fluctuation". Frequently, a result will be reported with
the criterion expressed explicitly, in a phrase similar to "the results
are significant at the .05 level". Sometimes, in practice more than
one criterion will be used, such as:
p>.1 non-significant
.1=>p>.05 suggestive
.05=>p>.01 significant
.01=>p highly significant
Labels other than "significant" are sometimes intended only informally.
E.g., a suggestive test is formally insignificant, but there is an
indication to potential replicators that perhaps there is an effect
which just requires a somewhat larger sample size. "Significant" and
"highly significant" both mean "significant", but "highly significant"
is less likely to turn out to just be a weird coincidence upon
replication.
The (formal) significance criterion is part of the test. In theory one
should no more select it after one has performed the test or even seen
the data than one should select any other aspect of the statistical
test on the basis of the results. Anything else is indeed "cheating"
(in some contexts the quotes are not needed).
In practice, in the social sciences, the way the criterion is selected
is: its .05 unless there is a perception of some risk -- real or
intellectual -- from a false rejection of the null hypothesis, in which
case the .01 level is used. For example, in parapsychology, until
relatively recently, the standard criterion used was .01 and
occasionally higher (i.e., smaller). It was realized, however, that
since even the staunchest (knowledgable) critics accepted that the
anomaly was statistically significant (the argument was and is whether
the statistical anomaly was/is attributable to "conventional" causes)
this was probably counterproductive. Useful data for resolving the
questions about what is going on was being thrown out. Now the .05
level is generally used in parapsychological research.
Topher
|
1145.10 | other complications | PULPO::BELDIN_R | Pull us together, not apart | Tue Nov 26 1991 12:25 | 28 |
| re all
Just to expand on Topher's comments, in particular in reference to the
Chi-squared tests.
I use the plural because the probability distribution which gives this kind
of test its name can arise in many, many different kinds of experiments
which are logically unrelated. One can use the Chi-squared distribution for
many tests, just as you can use dice to play monopoly, backgammon, or to
shoot craps.
It is rarely clear from the title "Chi-squared test" what the writer means.
Some applications are very gross approximations, others have some level of
robustness, and others can be shown to follow the Chi-squared distribution
exactly given some assumptions.
As suggested by others, some statisticians demand that you have chosen the
test procedure completely before collecting any data. This may include
everything right down to the text of the summary with a zero-one variable
used to decide which text you include and triggered by the (ultimate) result
of your observations.
In summary, there are no unanimous doctrines in statistics. We are all
free to make our own mistakes. :-)
Dick
|
1145.11 | Only one "real" chi-square test. | CADSYS::COOPER | Topher Cooper | Tue Nov 26 1991 13:19 | 27 |
| RE: .10 (Dick)
You have a point, but I don't entirely agree. There are a number of
quite distinct things which may be referred to as "a chi-square test",
because they make use of the chi-square distribution. For example,
a couple of weeks ago I was comparing a single sample variance to
an expected value using "a chi-square test".
But there is only one chi-square test which is generally called that
with qualifier or warning -- the chi-square frequency goodness-of-fit
test. It is an extremely powerful, flexible, and in its general form,
at times tricky to apply correctly, but it is essentially a single
test. It is used when you have a model of a situation which makes a
prediction about the frequency of independent events, and which you
wish to compare to some observed frequencies. One common use of this
is with models which assume the independence of two variables, and
so the "chi-square test of independence" is sometimes treated as a
separate test -- but it is, basically the same test. The trick to
applying it in atypical situations is being sure that the frequencies
*are* independent (i.e., that a chance fluctuation in one frequency
cell does not produce a similar or opposite fluctuation in another
except as accounted for in your model) and that you have the correct
number of degrees of freedom (i.e., that you have properly encorporated
into the test the degree to which the model forces the data to
conform).
Topher
|
1145.12 | further discussion of choosing a level of significance | CSSE::NEILSEN | Wally Neilsen-Steinhardt | Tue Nov 26 1991 13:48 | 37 |
| I agree with Topher in .9 and disagree with Peter in .8.
But there is a little more that can be said.
.7> Let me go back to one of the questions in .0 -- how do you "choose"
> whether to use .01 or .05? If the answer is "subjectively", might
> there be some sort of reference work of past case studies where tests
> are described and the authors say "We picked a confidence level of 95%
> because ..." ?
You would probably have to look pretty far to find this kind of statement.
Mostly the level of confidence is chosen by a social convention: all the
researchers in a field use 0.01 or 0.05 or whatever. I have occasionally
seen statistical papers which argued that the customary choice (for a
particular test in a particular field) is wrong, for one reason or another,
and a different choice should be made. I have never followed the discussion
to see if researchers in the field began using the new recommended level.
It is possible in principal to use a branch of decision theory to start from
statements like
accepting a false positive will cost me $x
rejecting a true positive will cost me $y
the cost of each sample is $z
the probability of the null hypothesis before testing is w%
and calculate the p-value at which you should reject the null hypothesis. As
a bonus, you also calculate the sample size you need. In practice, this is
seldom done, because the $ numbers and a priori probability are usually
difficult to estimate, and if you are going to guess, why not just guess
at % significance? Also, if you calculate a number which is within your
social convention, then you have wasted your time. If you calculate a
different number, then you either forget about it or plan to spend all your
time defending it.
However, this decision theory is usually behind discussions of what is the
best level of significance for a given test in a given field.
|
1145.13 | Fun with statistics | VMSDEV::HALLYB | Fish have no concept of fire | Tue Nov 26 1991 17:02 | 12 |
| Thanks for some very enlightening and helpful comments.
Let me ask further about a comment Wally made:
> the cost of each sample is $z
If one is looking at historical data, this question is difficult to
address. Basically the cost is zero but the supply is limited.
I presume this only affects the decision-theoretic methodology for
arriving at the best p-value, not the validity of the test itself.
John
|
1145.14 | Form of function. | CADSYS::COOPER | Topher Cooper | Tue Nov 26 1991 17:46 | 32 |
| RE: .13 (John)
I would say that you shouldn't get too hung up in the details of a
decision theoretic justification. A frequent criticism of decision
theory as a model of actual decision making, or as a universally
applicable method for reaching real decisions, is that it requires
assignment of "hard" utility numbers to attempt to capture "soft"
values. Some Bayesian decision theory attempts to solve the problem by
allowing the utility of a decision to have a distribution weighted
by (possibly subjective) probability. So instead of saying that
"the cost of each sample is $z", one says, essentially that "the
probability that each sample will be $z0 is p0, the probability that
each sample will be $z1 is p1, etc."
As a militant Bayesian, I'm sure Wally would rather not attempt to
justify non-Bayesian decision theory (which has been shown to be
inferior to Bayesian decision theory on very broad criteria). His
basic point, however, is accurate. Justifications for deviations
from customary alpha-levels take the *form* of decision theory (or
cost/benefit analysis), whether or not they deal with the nitty
gritty details sufficiently to be called decision theoretic
justifications.
Your particular point is easily dealt with, however. For historic
data, the cost of each of the first N samples is very low (perhaps, for
convenience, $0), while the cost of each additional sample is very
high (perhaps, for convenience, infinite). Same applies to other
situations which predetermine the sample size (e.g., there are only
50 states, even if you have not yet collected the relevant information
about each of them).
Topher
|
1145.15 | decisions and decision theories | CSSE::NEILSEN | Wally Neilsen-Steinhardt | Wed Nov 27 1991 12:11 | 19 |
| Topher gives the right answer to the question in .13: assume zero cost for the
data you have and infinite cost for the data you cannot get. This determines
test size so all you have left is to set the level of significance.
.14> As a militant Bayesian, I'm sure Wally would rather not attempt to
I don't recognize myself in this description. I personally prefer the Bayesian
interpretation of probability, because it fits the way I usually use it.
I don't think it is the only interpretation, or the best for everyone.
> justify non-Bayesian decision theory (which has been shown to be
I know several decision theories, and a lot of less formal approaches to
making decisions. A few are never the best choice, but most have some set
of decision problem for which they are the best approach. The Bayesian
approach happens to be the best answer to the question "How could I rigorously
design a statistical test and set a level of significance, assuming I could
get all the relevant information?" It is not the best answer to the question
"What level of significance should I use here?"
|
1145.16 | In defense of Bayesian Decision Theory. | CADSYS::COOPER | Topher Cooper | Wed Nov 27 1991 16:41 | 21 |
| RE: .15 (Wally)
Gee, I must have made you defensive -- never thought I'd see the day
where I would be supporting Bayesian statistics when Wally wasn't.
:-) (very much so).
Given that prior estimates -- of utilities (costs/benefits), and
liklihoods -- can be said to be at least approximately "rational"
(basically, non-self- contradictory) Bayesian decision procedures are
optimal over the long haul -- in the sense of making best use of
whatever accuracy is in those prior rational guesses to arive at the
most positive result. This even applies to selecting the best
significance criterion for non-Bayesian hypothesis-testing -- unless
you wish to assume that the correlation between prior estimates of
outcome and utility and the actual probabilities of outcome and
utility are negative.
It may not, of course, be the most practical when you factor in the
cost of the Bayesian computations.
Topher
|
1145.17 | Different languages for different folks | CORREO::BELDIN_R | Pull us together, not apart | Tue Dec 03 1991 09:24 | 11 |
| re <<< Note 1145.11 by CADSYS::COOPER "Topher Cooper" >>>
-< Only one "real" chi-square test. >-
I'll agree to that terminology for all those who can handle the abstraction
to a linear model with constraints. Unfortunately, the average consumer of
chi-squared statistics has not been taught to express his models that way,
but as separate models. As one reads the textbooks produced for social
and behavioral scientists, the level of abstraction is much lower then the
General Linear Model.
Dick
|