T.R | Title | User | Personal Name | Date | Lines |
---|
652.1 | variance^1/2 | MODEL::YARBROUGH | | Mon Jan 19 1987 15:59 | 12 |
| > What is the formula for the standard deviation for n samples?
It's the square root of the mean of the squares of the differences between
the observed values and the population mean.
pop.mean = (sum(1..n) x[i])/n
variance = (sum(1..n) (pop.mean-x[i])^2)/n
std. dev. = sqrt (variance)
Caveat: this calculation is subject to severe rounding errors.
|
652.2 | another version | ESTORE::ROOS | | Tue Jan 20 1987 14:39 | 16 |
| Two things:
1. Concerning .1's reply: The standard deviation for the population
has an n in the denominator, but the standard deviation for a
sample of the population has a n-1 in the denominator.
2. Another version for S.D.:
variance = (sum(1..n) x[i]^2 - (sum (1..n) x[i])^2)/n (for
population)
variance = (sum(1..n) x[i]^2 - (sum (1..n) x[i])^2)/(n-1)
(for a sample of a population)
S.D. = sqrt (variance)
|
652.3 | | CLT::GILBERT | eager like a child | Wed Jan 21 1987 01:12 | 2 |
| I seem to recall that the standard deviation in .1 *is* numerically
stable. The version in .2 can suffer from large round-off errors.
|
652.4 | | COGITO::ROTH | | Wed Jan 21 1987 09:09 | 4 |
| I also agree with .3, the version in .2 is sometimes convenient for
analysis but can suffer a negative square root due to round off errors.
- Jim
|
652.5 | One Pass Algorithm with Two Pass Accuracy | VAXAGB::BELDIN | Dick Beldin - 'Truth will Out' | Mon May 04 1987 12:33 | 46 |
|
A common problem is to calculate the Standard Deviation from data
stored in a file with only one pass and a static size of memory
requirements. The following algorithm provides accuracy equivalent to
the two pass calculation (1st mean, 2nd mean-squared-deviation).
Let n symbolize the number of observations already seen,
x = the most recently read value from the file,
mean = the arithmetic mean of all values read so far,
sumsquares = the sum of squared deviations from the mean.
Initialize the following (real) variables (in Pascal notation):
mean := 0;
sumsquares := 0;
n := 0;
Then with the following algorithm,
begin
Read_an_Observation(x);
n := n+1;
d := ( x - mean ) ) / n;
mean := mean + d;
sumsquares := sumsquares + (n-1) * d * d;
end;
After the last observation is processed, calculate
Population_Variance := sumsquares / n;
Sample_Variance := sumsquares / (n-1);
and
the standard deviations are the square roots of the respective
variances.
This algorithm has been known for some twenty years. I no longer have
any references to it.
|
652.6 | Single-pass calculation of quantiles | SSDEVO::LARY | | Wed May 06 1987 18:54 | 12 |
| In a similiar vein, there is an aricle in the October 1985 issue of
Communications of the ACM on a heuristic algorithm for calculating
arbitrary quantiles (a p-quantile of a distribution, 0<=p<=1, is the value
below which 100p percent of the distribution lies - the 0.5-quantile is
the median) with a single pass through the data and a very small amount
of working storage. It is claimed that this algorithm, run on a set of
samples of a distribution, produces an approximation of any quantile
essentially as good as the brute force approach (which involves partially
ordering the data, which takes as much memory as sorting it), provided the
distribution does not have a discontinuity near the desirred quantile.
One of the authors of the paper, Raj Jain, works for Digital.
|
652.7 | What Comprises The STD DEVIATION ? | ADCSRV::RBROWN | Are there no work houses ? | Mon Jul 01 1991 11:13 | 16 |
| Picking up on the standard deviation question. We're looking at it as
a representative of "confidance" for performance data. That is, the
closer the std deviation is to 0, the more likely our performance numbers
are to being what they should be. The std deviation is overlayed on a
series of bar charts. Hence, if the std deviation is low, then the
corresponding bar is probably more likly to be true. A bar with a high
std deviation may mean that the bar represents a large series of
peaks/valleys, perhaps a run-away process, etc ... This bar would be
brought into question.
Question though, what percentage of the overall data is represented by
the std deviation ? We beleive it to be 80%, that is 80% percent of
the numbers will fall into the range being specified. I've checked
through several books and can't seem to find a figure for this.
Thanks !
|
652.8 | 67% is a better (but not perfect) coverage rate | PULPO::BELDIN_R | | Mon Jul 01 1991 11:57 | 20 |
| There is no simple answer.
When the distribution is approximately normal, about 2/3 of the
observations will be within one standard deviation of the mean.
Any skew, excessive flattening or peakedness will distort this figure.
You can see the impact by running the calculations with distribution
functions given in most introductory texts on statistical and
probability theory.
Approximate normality is common where the deviations are many, as
likely to be high as low, and small. If there is a single very large
deviation that dominates the randomness, normality will typically be
violated.
As long as you don't base any critical decisions on the 2/3 figure, it
is a reasonable approximation for practical work.
Dick
|
652.9 | | VMSDEV::HALLYB | The Smart Money was on Goliath | Mon Jul 01 1991 22:22 | 12 |
| You really should consider pitching some percentage of your datapoints
as outliers. In performance work especially, you get oddball timings
that represent disk read errors or spurious datacomm path outages or...
While the -frequency- of such outliers may be important, their -value-
is almost surely unreliable and will tend to distort your other points.
Also be sure of the distribution you are measuring. Interarrival times,
for example, are often exponential in nature and the standard deviation
really isn't very helpful there, if you know what I "mean".
John
|
652.10 | Is it normal? | PAKORA::PFANG | | Tue Jul 02 1991 05:31 | 16 |
| You can get an idea of how close your data follows a normal (aka
Gaussian) distribution by plotting it on a Normal Probability axis. If
it falls approximately in a straight line, you get some confidence the
data may be normally distributed. If the line has curvature to it, you
may be dealing with a different distribution, for example exponential
(as mentioned in the previous reply). If you get points at the end that
don't fall on the line, you may have outliers (also mentioned
previously).
Do you really want the standard deviation, or do you want some kind of
confidence interval for your data? The standard deviation has a direct
interpretation if your data is normally distributed. But if you have
another situation (not normal and/or outliers) then there are more
`robust' measures of the variation of the data.
Peter
|
652.12 | what am I missing? | NOVA::FINNERTY | lies, damned lies, and the CAPM | Wed Jul 27 1994 16:26 | 8 |
|
re: expected value
what does E(r) = .80 mean? Do you mean that the average outcome
over all possible outcomes is a 20% loss? Sounds unattractive,
to say the least!
|
652.13 | "30" is a private jokelet | VMSDEV::HALLYB | Fish have no concept of fire | Wed Jul 27 1994 17:31 | 15 |
| I think E(�) == 0 implies a fair game, so a loss would be negative.
You probably need to know the underlying distribution to answer the
question. If you assume a near-normal distribution then
... use a Monte Carlo simulation to compare any two strategies. With a
little bit of work you could probably whip up a neat 3-d graph showing
probability of ruin-before-doubling as a function of mean and variance.
Unless Jim ("On the other hand, maybe 30 is about right" :-) Finnerty
comes up with a closed form solution.
John (forget the thinking, just go for 30)
|