T.R | Title | User | Personal Name | Date | Lines |
---|
1464.1 | | GUESS::DERAMO | duly noted | Thu Jun 27 1991 01:43 | 13 |
| Since
1/(harmonic mean) = arithmetic mean of reciprocals,
then I would guess that the harmonic standard deviation
is given by
1/(harmonic sd) = standard deviation of reciprocals
It sounds logical; however, I do not recall ever having
seen the term "harmonic standard deviation" before.
Dan
|
1464.2 | Doubts | IMTDEV::ROBERTS | Reason, Purpose, Self-esteem | Thu Jun 27 1991 15:26 | 30 |
| Hi, Dan
Well, that's pretty much what I guessed, too, but I'm doubtful.
Looking at the arithmetic mean (7.716667) and the harmonic mean
(7.626850), they're quite close.
But looking at the standard deviation (1.016667) and 1/(sd of
reciprocals) (57.388320), there's quite a disparity. Also consider that
the closer the values of the samples, the larger 1/(sd of reciprocals)
grows!
I don't know that there is such a thing as a harmonic standard
deviation, but I'd expect there to be some way to describe the
disbursement around a harmonic mean. We should be able to solve
problems like the following:
Given two sample sets of times:
A (a sample of the population ALPHA) = { 5, 6, 7 }
B (a sample of the population BETA) = { 4, 6, infinity }
what's the probability the average time of BETA is less than the
average time of ALPHA?
The harmonic mean of A is 5.88785; that of B is 7.20000. But what are
the standard deviations?
Dwayne
|
1464.3 | A Possible Approach | VAXRT::BRIDGEWATER | Eclipsing the past | Thu Jun 27 1991 21:49 | 38 |
| I don't know the actual underlying situation that's being modeled
here, so what follows may not be all useful. I'd think that if a
harmonic mean is more useful that an arithmetic mean, it means that
you are really interested in 1/units rather than units (for example
rate rather than time). Using this view, the regular standard
deviation of the reciprocals of the sample points would be most useful
as a measure of dispersion. It is measured in 1/units. You could
translate this to units by taking the reciprocal, but it's not
meaningful as a measure of dispersion in any way that I can tell.
Continuing in this vein, you might want to calculate (in units) points
lying a certain number of standard deviations from the mean. This
can be done by calculating everything using the reciprocal distribution
and then taking the reciprocal of the result to convert it from 1/units
to units.
An example:
s[1] = 1 units
s[2] = 2 units
s[3] = 3 units
measured in units
arithmetic mean = 2.000000 std dev = 0.816497
-z sd = 2 - 0.816497*z -1 sd = 1.183503
+z sd = 2 + 0.816497*z +1 sd = 2.816497
measured in 1/units
reciprocal mean = 0.611111 reciprocal std dev = 0.283279
-z sd = 0.611111 - 0.283279*z -1 sd = 0.327832
+z sd = 0.611111 + 0.283279*z +1 sd = 0.894390
measured in units
harmonic mean = 1.636364 (z=0)
-z sd = 1 / (0.611111 - 0.283279*z) -1 sd = 3.050341
+z sd = 1 / (0.611111 + 0.283279*z) +1 sd = 1.118081
- Don
|
1464.4 | | IMTDEV::ROBERTS | Reason, Purpose, Self-esteem | Mon Jul 01 1991 12:04 | 16 |
| Thanks, Don.
I have two questions about your example, please:
1. How did you arrive at the reciprocal std dev of 0.283279? I can't
seem to come up with the same number. Are you taking the s dev of {1/1,
1/2, 1/3}?
2. In the last "measured in units" section, you show the harmonic mean
to be 1.636364. But then you use the reciprocal mean 0.611111 in the
expressions for z sd. Was this intentional?
Again, I appreciate your help.
Dwayne
|
1464.5 | | VAXRT::BRIDGEWATER | Eclipsing the past | Wed Jul 03 1991 06:15 | 41 |
| Re: .4
>1. How did you arrive at the reciprocal std dev of 0.283279? I can't
>seem to come up with the same number. Are you taking the s dev of {1/1,
>1/2, 1/3}?
Yes, the arithmetic mean of {1/1, 1/2, 1/3} is 11/18 = 0.611111 and the
standard deviation is sqrt(26)/18 = 0.283279. Actually, if these are
samples, we should use n-1 = 2 as the divisor inside the square root
for the standard deviation, but I calculated the "population" standard
deviation which uses n = 3 as the divisor. So, in detail:
s = sqrt ((1/1-11/18)**2 + (1/2-11/18)**2 + (1/3-11/18)**2))/3)
= sqrt ((7/18)**2 + (-2/18)**2 + (-5/18)**2)/3)
= sqrt ((49+4+25)/(3*18*18))
= sqrt (78/3) / 18
= sqrt (26)/18 = 0.283279
>2. In the last "measured in units" section, you show the harmonic mean
>to be 1.636364. But then you use the reciprocal mean 0.611111 in the
>expressions for z sd. Was this intentional?
Yes, I'm using the reciprocal mean there. Remember this is because my
analysis uses the distribution and statistics of the reciprocals as
the frame of reference. In this view there isn't a single number that
represents a "harmonic standard deviation". Instead, the best that you
can do is calculate the point that represents z standard deviations above
or below the mean using the reciprocals and then take the reciprocal of
the result to express the result in the original units of the sample.
This kind of calculation would generally be useful if you were trying
to calculate confidence intervals. But to use them successfully in this
way, you'd also need to know more about the probability distribution of
the sample reciprocals.
- Don
|
1464.6 | Idle Statistics Churning, FWIW: | CHOSRV::YOUNG | Still billing, after all these years. | Wed Jul 03 1991 18:07 | 26 |
| Since: Where:
Am = SUM(i=1,N) S(i)/N : "Am" is Arithmetic Mean
and
Hm = N/( SUM(i=1,N) 1/S(i) ) : "Hm" is Harmonic mean
And
Sd = SQRT( ( SUM(i=1,N) (Am-S(i))^2 )/N )
: "Sd" is Standard Deviation
Then I would assume:
Hd = SQRT( N/( SUM(i=1,N) 1/(Hm-S(i))^2 ) )
Where "Hd" is the Harmonic Deviation. For the values that you gave,
this results in a value of .15434, which is much closer to your
Standard Deviation.
Of course I am using the formulas for a population not a sample, so you
may have to adjust this.
-- Barry
|
1464.7 | | VAXRT::BRIDGEWATER | Eclipsing the past | Mon Jul 08 1991 13:09 | 11 |
| Re: .6
This is an interesting approach I hadn't thought of. Unfortunately,
it appears to be sensitive to how close any one of the sample points is
to the harmonic mean. That is, if any one sample point is very close
to the harmonic mean, the "harmonic standard deviation" is very small
regardless of how distant the other samples are. If you are unlucky
enough to get one sample point equal to the harmonic mean, then the
harmonic standard deviation is zero.
- Don
|
1464.8 | What is the problem that you are trying to solve? | CHOSRV::YOUNG | Still billing, after all these years. | Tue Jul 09 1991 12:07 | 12 |
| True. But the question that is begging to be asked here is: "Why are
you using Harmonic Means in the first place?"
Harmonic means are a somewhat unusual Figure of Merit for a dataset in
the first place. Presumably they were choosen with some reason in mind
and that reason should lead us to some logic for determining an
appropiate Deviation measurement.
Without this, you really cannot say that one measure is better than
another, or that one measure of Deviance is better than another.
-- Barry
|
1464.9 | Rats! | IMTDEV::ROBERTS | Reason, Purpose, Self-esteem | Tue Jul 09 1991 18:28 | 24 |
| Harmonics were chosen because the sample set can include infinite
values, but no zero values; that is, {5, 10, 100, oo, oo} is possible.
In this example, the arithmetic mean is oo; but the harmonic mean is
16.1. The units involved are minutes. "oo" means it didn't finish.
If you want a concrete example, suppose you're measuring the time it
takes two batches of rats to navigate a maze. Batch "A" takes vitamin
Z, while batch "B" is the control. Each batch has 9 rats. The sample
times in minutes are:
r a t
1 2 3 4 5 6 7 8 9
A 1.7 1.0 oo 2.1 1.6 1.2 0.9 1.3 1.8
B 1.6 oo 1.9 oo 2.6 1.8 2.1 1.0 2.2
What's the probability that vitamin Z helps rats navigate the maze?
The harmonic mean of batch A's times is 1.5 minutes; that of batch B
is 2.2 minutes.
Dwayne
|
1464.10 | you gotta know the territory | CSSE::NEILSEN | Wally Neilsen-Steinhardt | Wed Jul 10 1991 13:58 | 48 |
| This topic illustrates a general rule in statistics: the more you understand
the process the better you can apply statistics to it. The general question in
.0 has no answer, because too little is known to define an answer.
Based on the example in .9, I would suggest two things:
1 - compute the fraction of rats finishing in each case. There are good tests
to determine whether the differences are statistically significant.
2 - look at the distribution of times for rats who finish in each case. Check
the form of this distribution against some standard forms, like normal,
log-normal, Gaussian, exponential and so forth. If necessary, transform
your coordinate to one which gives a more nearly normal distribution. Use the
standard hypothesis test to determine whether the difference in times for rats
who finish is statistically significant.
Several more random comments:
All the above is based on .9 and my general ignorance of rats and vitamins. If
I knew more, for example, the usual distribution of maze running times, the
usual fraction of rats not completing, the common effects of vitamins, and so
forth, I would want to incorporate that additional information in my
statistical approach. Similarly, if your real problem is one of computer
hardware performance analysis, then I would look for completely different
information, and incorporate that into my statistical analysis.
If there is a usual practice in your field of study, you should follow it.
Particularly in government and commercial work, it is usually better to follow
standard practice, even if there is a statistical method which is in theory
superior.
You can't really put down infinity for any times, since infinity is well known
to be longer than the average lifetime of a graduate student. Some of those
rats might eventually have found their way out of the maze. All you really know
is the time at which you terminated that run. If you include that time, you
are introducing an extraneous factor which may affect your results; this is why
I recommended analyzing only the times for the rats who finish.
Your sample distribution must be close to normal in order to apply the usual
statistical tests of hypotheses. If your underlying distribution is far from
normal, it may take a lot of samples to make the sample distribution close to
normal. This is why it helps to transform your coordinates to bring the
underlying distribution closer to normal. .3 is one example of this, using
the reciprocal transform y = 1/x.
Note that sample distribution in the paragraph above is sometimes, and more
correctly, called the distribution of sample means.
|
1464.11 | or maybe a Z-deficiency | VMSDEV::HALLYB | The Smart Money was on Goliath | Thu Jul 11 1991 12:25 | 10 |
| If you stop the experiment after 3 minutes, then clearly you don't
want to (nor did you) say the rat took 3 minutes to finish.
That would be wrong, because the rat didn't finish in 3 minutes.
By the same logic you can't put down oo either, because you have no
proof that it would take the rat an infinite time to navigate the maze.
An interesting special case would be if the rat died while trying to
navigate the maze (perhaps an overdose of Vitamin Z :-) but it would
probably be wisest to discard those special cases.
John
|
1464.12 | Comparing samples | BHUNA::PFANG | | Fri Jul 12 1991 09:53 | 38 |
| RE: <<< Note 1464.9 by IMTDEV::ROBERTS "Reason, Purpose, Self-esteem" >>>
-< Rats! >-
So the real question is not "What's the mean and spread?"
The real question is:
"Is sample A significantly different from sample B?"
If you want to answer the 2nd question, then don't worry about which
descriptive statistic to use to represent the location (e.g. harmonic
mean) or spread (harmonic standard deviation?). The question, then
becomes which comparison test is appropriate for comparing the two
groups of data.
For example, if the data followed a simple `normal' distribution,
one might investigate using the t-test to compare the samples. In this
case, we don't know much about the distributions, but it seems certain
that it doesn't come from a normal distribution.
A fairly robust comparison test is called the Mann-Whitney or
Wilcoxon test. It is robust in that it is not sensitive to the shape of
the underlying distribution, nor is it sensitive to `wide' variations
(like infinities) in the data. Essentially what the test does is rank
each data value based on where it falls in the total order of both
groups. In this case, the infinities rank just higher than the next
highest value (2.6 minutes).
The proper way to do this comparison, would be to state a priori,
what risk one is willing to take in making an incorrect conclusion. For
example, a typical statement would be
`I want to say vitamin Z helps rats navigate the maze, and I want to
take no more than a 5% risk of being wrong.'
The answer in this case would be `Vitamin Z does not help.'
This is because, based on the Mann-Whitney test, there is a .1615
probability that the A and B samples were drawn from a single
distribution.
|
1464.13 | Un-asking the question | AGOUTL::BELDIN | Pull us together, not apart | Fri Jul 12 1991 11:10 | 102 |
| re from .9 on
Your consultants are giving you good advice. Think about the way
the data is generated and how to use standard parametric or
non-parametric methods before you invoke the harmonics. Here is
what I would say to a client consulting me with this problem.
Comparing two samples of maze running times
The problem is to determine whether two samples of experimental animals,
previously treated with treatments A and B, differ statistically in
1) the length of time taken to run the maze, or
2) the fraction of animals able to complete the test.
1. Let us start by defining two families of random variables,
X(i) and Y(i).
1.1 Y(i) is a Boolean variable which records whether or
not the i'th subject completed the maze.
1.2 Define X(i) conditionally on Y(i) being true, as the
length of time required by the i'th subject to complete
the maze. If Y(i) is false, then X(i) is undefined.
2. Extend the notation with a parameter S which represents
the treatment to which an animal is subjected. S can
assume the values A and B. With this addition, we
describe the random variables X and Y as
2.1 Y(i,S) is the Boolean response to S of the i'th
subject.
2.2 X(i,S) is the conditionally defined response time of
the i'th subject to stimulus S.
3. The hypotheses which interest us are:
3.1 The distribution of Y(i,S) for i=1..N is independent
of the choice of S.
3.2 The distribution of X(i,S) for i in {k|Y(k,S)} is
independent of the choice of S.
4. We formulate the response models as:
4.1 Y(i,A) is a Bernoulli variable with parameter a, and
Y(i,B) is a Bernoulli variable with parameter b.
4.2 X(i,A) is a linear function, A + Z(i) and X(i,B) is a
linear function, B + Z(i).
4.3 Z(i) are independent random variables with mean m and
standard deviation s. Note that you can't estimate m by
itself because it will be confounded by A or B. The best
you can do is estimate A+m, B+m, or A-B. You don't need to
use harmonic means and standard deviations (for which there
is no reliable theory).
5. Next we need to specify how the stimuli, S, were assigned
to the subjects. There are several cases:
5.1 The subjects were randomly subdivided into two
sub-samples and each member of one sub-sample was treated
with A and each member of the other sub-sample was treated
with B.
5.2 The subjects were individually and independently
assigned treatment (or stimulus) A or B.
5.3 The subjects were presented to the experimenters in
serial fashion, each subject was then assigned A or B.
5.4 Subjects were classified on several traits which might
be correlated with X and Y. Then random assignments of A
and B were made, attempting to balance the distribution of
the correlated traits.
5.5 The same subjects were treated, sometimes with A and
sometimes with B. (This design is only useful if the
effect is expected to be temporary.)
5.6 The subjects were allowed to self-select whether they
would receive treatment A or B (by making both available,
for example).
All of this is just to show that we can study the set of data to
a variety of levels of detail and for a variety of experimental
designs, all of which appear to be consistent with the original
description, but which in no case require (or even suggest) the
use of the harmonic mean.
Dick
|
1464.14 | agreement with previous two | CSSE::NEILSEN | Wally Neilsen-Steinhardt | Fri Jul 12 1991 14:51 | 13 |
| .12 and .13 have some good stuff.
An extension to .12 is you can also ask what risk you are willing to take of
incorrectly making incorrectly the statement
Vitamin Z has no effect
The two risks together can be combined with your understanding of the system
you are studying to decide how to design your experiment:
"I'll accept a 5% risk type I error (as in .12) and a 10% risk of
making a type II error (as above), and the assumptions of Mann-Whitney
look correct for my system, so I'll need to run 38 trials."
|