[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1735.0. "Statistical Mean and Variance Tests" by GOCELT::RAK (So David prevailed over the Philistine) Fri Mar 26 1993 14:04

I have two elementary questions about applied statistics.
One involves the mean test and the other the variance test.
I would like to know how to make the best use of my historical
data.

I collect daily samples of about 50 values each.  The sample
size doesn't vary more than 2 percent either way.  I get a
sample mean and a sample variance from each of them.  After
a couple of months I'll have a historical base which I intend
to use for estimates of the population parameters.

Now, when I get a new sample X of n (n=50) values, I'd like to
test its mean M(X) and variance S2(X) against the historical base.
First, I calculate the t statistic of the sample in the usual
way, using the mean of the sample means for mu:

        t(X) = (M(X) - mu) / (S(X) / sqrt(n))

The denominator of this equation is the standard error of the
mean, which is by definition an estimate of the standard deviation
of sample means.  Since I have accumulated a large number of sample
means, the question arises as to why I can't or shouldn't estimate
this standard error directly from my historical base.  In this
case, because the sample means follow a normal distribution,
I'll have a Z statistic instead of a t statistic.  Will I get
a significantly different p-value for the mean test?  If so, how
will I interpret it?

[Even if it makes mathematical sense to do this, I'll feel
uncomfortable with testing the mean of a sample without using
the sample's variance at all.  In fact, I won't be using the
variance of any of my samples for the mean test.]

Next, I calculate the chi-square statistic of the sample X in the
usual way, using an estimate for the population variance sigma2:

        x(X) = (n-1) * S2(X) / sigma2

I want to estimate sigma2 using my historical base.  I can just
average the sample variances in the same way I do the sample means.
Or I can average their square roots (standard deviations) and then
square the average.  Or maybe some other way.  Which one will result
in the greatest power for the variance test?

Any comments will be appreciated.  Please feel free to be tutorial.

Regards,
L. Dang

T.R	Title	User	Personal Name	Date	Lines
1735.1		STAR::ABBASI	i am therfore i think	`Fri Mar 26 1993 15:06`	42
	some more notes to look at in case it might be relevent: 406 TOOLS::STAN 14-DEC-1985 0 Math Notes Statistics 738 NAC::PICKETT 23-JUL-1987 1 Statistical Inference 823 SPYDER::TURNER 3-FEB-1988 1 Statistical Packages Available? 1065 BEING::POSTPISCHIL 21-APR-1989 9 Statistics Books 1145 MILKWY::JANZEN 30-OCT-1989 17 Chi-Square X^2 calculation (sta 1184 DWOVAX::YOUNG 22-JAN-1990 10 HELP! Compound Statistical Esti 1394 SMEGIT::ARNOLD 5-MAR-1991 24 Elementary (?) Statistics 1447 ATSE::GOODWIN 23-MAY-1991 23 Statistics can be (mis)used to 1481 CSSE::NEILSEN 15-AUG-1991 11 another statistical conundrum 1483 SOLVIT::DESMARAIS 22-AUG-1991 4 Statistical Data Anal in the Co 1489 MEIS::SCHAUBER 4-SEP-1991 3 Statistical Functions Lib. 1610 MINDER::WIGLEYA 18-MAY-1992 2 Wanted: Sanity check onmy rust 1625 EPIK::FINNERTY 10-JUN-1992 4 Computing the tstatistic for r 1665 LARVAE::TREVENNOR_A 17-SEP-1992 4 Calculating a statistical likel 1710 MARVA2::RAK 13-JAN-1993 8 Needed: Slick StatisticTrick 1735 GOCELT::RAK 26-MAR-1993 0 Statistical Mean andVariance T End of requested listing
1735.2	Mean variance is the mean of the variances.	CADSYS::COOPER	Topher Cooper	`Fri Mar 26 1993 16:39`	63
	I'm going to answer your second question first about the proper "average" variance to use. Brief answer: use the sum of the variances divided by the n. You can take the square root of that and get an "average" standard deviation, if you wish. Explanation: If you have a bunch of ordinary variables x1, x2, ..., xn if you want to take the "average" you add them together and then you "rescale" them to the same units as the original by dividing by n. What you end up with is something that is a kind of rough stand in for the seperate values, which we will call x. Now imagine that you do the same thing, over and over again. Each time you take the average. Each set of variables xi is a sample. There will be a distribution of values that each of the xi will take (perhaps each of the xi will have the same distribution of values, perhaps not). In other words, we are now talking about a set of random* variables which we will call X1, X2, ..., XN. The x* (the average) produced with each sample, also will have a distribution associated with it (which can, in principle at least be derived from the distributions of the Xi), and is therefore "really" a random variable X. The mean (pretend that doesn't mean the same thing as "average") and the variance/standard deviation are descriptors of the distributions (one being an indicator of the "location" of the distribution and the others being an indicator of the "scale" of the distribution). We want to know what the mean and the variance of X are. Let's add one more random variable, X+, which is the sum of the Xi's. X* is then X+/n. What is the mean of X? Well the mean is a "linear operator". That has two important consequences: 1) The mean of the sum of random variables is the sum of their means. ( E(A+B) = E(A) + E(B) ). 2) The mean of a random variable divided by a constant is the same as dividing the mean of the random variable by the constant. ( E(A/c) = E(A)/c ) So the mean of X+ is the sum of the means of the Xi, and the mean of X (= X+/n) is the mean-of-X+ divided by n. This is what justifies the procedure of finding the "average" mean by taking the average of the means. What about the variance of our average value (X)? Well, it turns out that the variance is also* linear (actually not surprising when you remember that the variance is a mean: its the mean of the squared deviations from the overall mean), so the same reasoning applies: the variance of the average is the average of the variances. The standard deviation is not linear, however. Since it does represent a "scale" you can multiply it by some scalar and get something meaningful, but you should never add standard deviations to each other, it doesn't mean anything. Note that since the "average" is the same thing as the "mean", the above says that you will get the same result however you divide up the samples into subsamples and take the averages of the averages. You will get the mean and the variance of the whole thing. I'll try to address your other problem as soon as I have a moment. Topher
1735.3	How about using SPC?	YIELD::FANG		`Tue Mar 30 1993 13:59`	49
	Another way to look at your questions is via the methodology of Statistical Process Control (SPC). As long as you're collecting independant samples from a stable process, you can check the mean and variance of each new sample against your past historical data. SPC provides a nice simple graphical technique to apply the tests. I'll attempt to describe the tests as best I can. The Mean test would be accomplished by testing the sample mean against a confidence interval (usually .9973) described by � � 3sigma/sq.root(n). Where n=50 in this case, and sigma is the estimate of the population sigma. The process mean is the mean of all the values in your database, or the mean of your sample means (where each sample represents 50 values). Sigma can be estimated from all your historical data using the standard formula,i.e., why deal with sample variances or standard deviations if you have all the historical data which you can calculate the total sigma on. The variance test can be accomplished by testing the variance against limits set by: ______ Upper Limit = VAR(x) ------ Chi�(alpha/2,n-1) n-1 ______ Center Line = VAR(x) ______ Lower Limit = VAR(x) ------ * Chi�(1-alpha/2,n-1) n-1 where alpha is the risk of a Type I error, Chi�(alpha,n) is the alpha percentage point of the chi-square distribution with n degrees of freedom, ______ VAR(x) is the mean variance Actually, it is more common to test the standard deviation rather than the variance in SPC. The SPC approach was started in the 20's by Shewhart. There are multiple `tests' one can do to the mean and sigma to look for anything unusual in the sample, all with the intent of keeping the alpha and beta risks of the tests as small as possible. Hope this is of some small help. Peter
1735.4	Question on SPC rejection region	MARVA1::RAK		`Thu Apr 01 1993 09:25`	7
	Thanks for the information on SPC. It will help us. From your explanation of the mean test, I wonder why the rejection region is so small (.0037) compared to the conventional .05 in textbook hypothesis testing. Lam
1735.5	False rejections more costly.	CADSYS::COOPER	Topher Cooper	`Thu Apr 01 1993 17:09`	39
	RE: .4 Different purposes. The number concerned represents the number of "false positives" you are willing to tolerate. In conventional hypothesis testing the false positive is false rejection of the null hypothesis. That is, you will tentatively conclude -- (tentatively because if it is important enough there will presumably be at least one replication, and if it is less important it will still probably lead to later contradictions which will cause it to be challenged -- only if it is completely unimportant will it not be re-examined) -- that something "interesting" is going on. The cost of a false positive is relatively low -- when it is not, stricter criteria are used (e.g., the second standard of .01). A high alpha means fewer false negatives -- you are less likely to miss something interesting. In SPC if you get a "positive" then you throw away the batch, shut down the line for inspection, send everyone to the bomb shelters, order additional, costly tests or whatever. A false positive is expensive and you cannot tolerate as high a level as you can in research. Therefore you set it low -- and I'm not sure just where that particular standard value came from. Its not clear that a standard value is appropriate at all. In research the costs and the rewards are quite intangible. It is, therefore hard to do detailed cost/benefit analysis to determine the proper alpha. A rule of thumb is therefore appropriate (though whether the one we have ended up with is the one that we should be using is an interesting question). In SPC, generally, the costs and benefits of various choices are much easier to quantify and so cost/benefit analysis would seem to be the way to go. (Probably argued in the SPC literature which I am almost totally ignorant of). Topher
1735.6	And...	YIELD::FANG		`Tue Apr 06 1993 15:30`	19
	I think the previous reply (.5) summed it up quite nicely. I would only add the following: The .0037 is with only 1 `rule' applied. Very commonly, the Western Electric rules and/or Nelson rules are applied which increase alpha to a level ~.01 or greater. Examples are: 2 of 3 points beyond the 2-sigma limit, 6 consecutive points increasing or decreasing in the same direction. So, the alpha risk isn't really that low. The advantage of applying more rules (weighed against the increased alpha) is the greater `statistical power' of the test: the ability to reject Ho when H1 is true. Topher is right in that the economics of the situation should drive the choices. There is a chapter in a book called ``Introduction to STATISTICAL QUALITY CONTROL'' by Montgomery (publ.=Wiley) that is entitled "Economic Design of Control Charts". You can guess what this chapter is about! Peter