[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1464.0. "Harmonic Standard Deviation" by IMTDEV::ROBERTS (Reason, Purpose, Self-esteem) Wed Jun 26 1991 16:58

    I need to calculate some simple statistics on some sample times. The
    stats I need are the harmonic mean and harmonic standard deviation. I
    have no problem calculating the harmonic mean. The harmonic standard
    deviation is giving me some trouble.

    Harmonic mean:  		    n
    			------------------------
    			1/s1 + 1/s2 + ... + 1/sn

    where s1 through sn are the n samples.

    In my experiment, the first three s's are

    s1=8.733333
    s2=7.716667
    s3=6.700000

    The harmonic mean is 7.626850.

    How do I calculate the standard deviation?

    Dwayne

T.R	Title	User	Personal Name	Date	Lines
1464.1		GUESS::DERAMO	duly noted	`Thu Jun 27 1991 00:43`	13
	Since 1/(harmonic mean) = arithmetic mean of reciprocals, then I would guess that the harmonic standard deviation is given by 1/(harmonic sd) = standard deviation of reciprocals It sounds logical; however, I do not recall ever having seen the term "harmonic standard deviation" before. Dan
1464.2	Doubts	IMTDEV::ROBERTS	Reason, Purpose, Self-esteem	`Thu Jun 27 1991 14:26`	30
	Hi, Dan Well, that's pretty much what I guessed, too, but I'm doubtful. Looking at the arithmetic mean (7.716667) and the harmonic mean (7.626850), they're quite close. But looking at the standard deviation (1.016667) and 1/(sd of reciprocals) (57.388320), there's quite a disparity. Also consider that the closer the values of the samples, the larger 1/(sd of reciprocals) grows! I don't know that there is such a thing as a harmonic standard deviation, but I'd expect there to be some way to describe the disbursement around a harmonic mean. We should be able to solve problems like the following: Given two sample sets of times: A (a sample of the population ALPHA) = { 5, 6, 7 } B (a sample of the population BETA) = { 4, 6, infinity } what's the probability the average time of BETA is less than the average time of ALPHA? The harmonic mean of A is 5.88785; that of B is 7.20000. But what are the standard deviations? Dwayne
1464.3	A Possible Approach	VAXRT::BRIDGEWATER	Eclipsing the past	`Thu Jun 27 1991 20:49`	38
	I don't know the actual underlying situation that's being modeled here, so what follows may not be all useful. I'd think that if a harmonic mean is more useful that an arithmetic mean, it means that you are really interested in 1/units rather than units (for example rate rather than time). Using this view, the regular standard deviation of the reciprocals of the sample points would be most useful as a measure of dispersion. It is measured in 1/units. You could translate this to units by taking the reciprocal, but it's not meaningful as a measure of dispersion in any way that I can tell. Continuing in this vein, you might want to calculate (in units) points lying a certain number of standard deviations from the mean. This can be done by calculating everything using the reciprocal distribution and then taking the reciprocal of the result to convert it from 1/units to units. An example: s[1] = 1 units s[2] = 2 units s[3] = 3 units measured in units arithmetic mean = 2.000000 std dev = 0.816497 -z sd = 2 - 0.816497z -1 sd = 1.183503 +z sd = 2 + 0.816497z +1 sd = 2.816497 measured in 1/units reciprocal mean = 0.611111 reciprocal std dev = 0.283279 -z sd = 0.611111 - 0.283279z -1 sd = 0.327832 +z sd = 0.611111 + 0.283279z +1 sd = 0.894390 measured in units harmonic mean = 1.636364 (z=0) -z sd = 1 / (0.611111 - 0.283279z) -1 sd = 3.050341 +z sd = 1 / (0.611111 + 0.283279z) +1 sd = 1.118081 - Don
1464.4		IMTDEV::ROBERTS	Reason, Purpose, Self-esteem	`Mon Jul 01 1991 11:04`	16
	Thanks, Don. I have two questions about your example, please: 1. How did you arrive at the reciprocal std dev of 0.283279? I can't seem to come up with the same number. Are you taking the s dev of {1/1, 1/2, 1/3}? 2. In the last "measured in units" section, you show the harmonic mean to be 1.636364. But then you use the reciprocal mean 0.611111 in the expressions for z sd. Was this intentional? Again, I appreciate your help. Dwayne
1464.5		VAXRT::BRIDGEWATER	Eclipsing the past	`Wed Jul 03 1991 05:15`	41
	Re: .4 >1. How did you arrive at the reciprocal std dev of 0.283279? I can't >seem to come up with the same number. Are you taking the s dev of {1/1, >1/2, 1/3}? Yes, the arithmetic mean of {1/1, 1/2, 1/3} is 11/18 = 0.611111 and the standard deviation is sqrt(26)/18 = 0.283279. Actually, if these are samples, we should use n-1 = 2 as the divisor inside the square root for the standard deviation, but I calculated the "population" standard deviation which uses n = 3 as the divisor. So, in detail: s = sqrt ((1/1-11/18)2 + (1/2-11/18)2 + (1/3-11/18)2))/3) = sqrt ((7/18)2 + (-2/18)2 + (-5/18)2)/3) = sqrt ((49+4+25)/(31818)) = sqrt (78/3) / 18 = sqrt (26)/18 = 0.283279 >2. In the last "measured in units" section, you show the harmonic mean >to be 1.636364. But then you use the reciprocal mean 0.611111 in the >expressions for z sd. Was this intentional? Yes, I'm using the reciprocal mean there. Remember this is because my analysis uses the distribution and statistics of the reciprocals as the frame of reference. In this view there isn't a single number that represents a "harmonic standard deviation". Instead, the best that you can do is calculate the point that represents z standard deviations above or below the mean using the reciprocals and then take the reciprocal of the result to express the result in the original units of the sample. This kind of calculation would generally be useful if you were trying to calculate confidence intervals. But to use them successfully in this way, you'd also need to know more about the probability distribution of the sample reciprocals. - Don
1464.6	Idle Statistics Churning, FWIW:	CHOSRV::YOUNG	Still billing, after all these years.	`Wed Jul 03 1991 17:07`	26
	Since: Where: Am = SUM(i=1,N) S(i)/N : "Am" is Arithmetic Mean and Hm = N/( SUM(i=1,N) 1/S(i) ) : "Hm" is Harmonic mean And Sd = SQRT( ( SUM(i=1,N) (Am-S(i))^2 )/N ) : "Sd" is Standard Deviation Then I would assume: Hd = SQRT( N/( SUM(i=1,N) 1/(Hm-S(i))^2 ) ) Where "Hd" is the Harmonic Deviation. For the values that you gave, this results in a value of .15434, which is much closer to your Standard Deviation. Of course I am using the formulas for a population not a sample, so you may have to adjust this. -- Barry
1464.7		VAXRT::BRIDGEWATER	Eclipsing the past	`Mon Jul 08 1991 12:09`	11
	Re: .6 This is an interesting approach I hadn't thought of. Unfortunately, it appears to be sensitive to how close any one of the sample points is to the harmonic mean. That is, if any one sample point is very close to the harmonic mean, the "harmonic standard deviation" is very small regardless of how distant the other samples are. If you are unlucky enough to get one sample point equal to the harmonic mean, then the harmonic standard deviation is zero. - Don
1464.8	What is the problem that you are trying to solve?	CHOSRV::YOUNG	Still billing, after all these years.	`Tue Jul 09 1991 11:07`	12
	True. But the question that is begging to be asked here is: "Why are you using Harmonic Means in the first place?" Harmonic means are a somewhat unusual Figure of Merit for a dataset in the first place. Presumably they were choosen with some reason in mind and that reason should lead us to some logic for determining an appropiate Deviation measurement. Without this, you really cannot say that one measure is better than another, or that one measure of Deviance is better than another. -- Barry
1464.9	Rats!	IMTDEV::ROBERTS	Reason, Purpose, Self-esteem	`Tue Jul 09 1991 17:28`	24
	Harmonics were chosen because the sample set can include infinite values, but no zero values; that is, {5, 10, 100, oo, oo} is possible. In this example, the arithmetic mean is oo; but the harmonic mean is 16.1. The units involved are minutes. "oo" means it didn't finish. If you want a concrete example, suppose you're measuring the time it takes two batches of rats to navigate a maze. Batch "A" takes vitamin Z, while batch "B" is the control. Each batch has 9 rats. The sample times in minutes are: r a t 1 2 3 4 5 6 7 8 9 A 1.7 1.0 oo 2.1 1.6 1.2 0.9 1.3 1.8 B 1.6 oo 1.9 oo 2.6 1.8 2.1 1.0 2.2 What's the probability that vitamin Z helps rats navigate the maze? The harmonic mean of batch A's times is 1.5 minutes; that of batch B is 2.2 minutes. Dwayne
1464.10	you gotta know the territory	CSSE::NEILSEN	Wally Neilsen-Steinhardt	`Wed Jul 10 1991 12:58`	48
	This topic illustrates a general rule in statistics: the more you understand the process the better you can apply statistics to it. The general question in .0 has no answer, because too little is known to define an answer. Based on the example in .9, I would suggest two things: 1 - compute the fraction of rats finishing in each case. There are good tests to determine whether the differences are statistically significant. 2 - look at the distribution of times for rats who finish in each case. Check the form of this distribution against some standard forms, like normal, log-normal, Gaussian, exponential and so forth. If necessary, transform your coordinate to one which gives a more nearly normal distribution. Use the standard hypothesis test to determine whether the difference in times for rats who finish is statistically significant. Several more random comments: All the above is based on .9 and my general ignorance of rats and vitamins. If I knew more, for example, the usual distribution of maze running times, the usual fraction of rats not completing, the common effects of vitamins, and so forth, I would want to incorporate that additional information in my statistical approach. Similarly, if your real problem is one of computer hardware performance analysis, then I would look for completely different information, and incorporate that into my statistical analysis. If there is a usual practice in your field of study, you should follow it. Particularly in government and commercial work, it is usually better to follow standard practice, even if there is a statistical method which is in theory superior. You can't really put down infinity for any times, since infinity is well known to be longer than the average lifetime of a graduate student. Some of those rats might eventually have found their way out of the maze. All you really know is the time at which you terminated that run. If you include that time, you are introducing an extraneous factor which may affect your results; this is why I recommended analyzing only the times for the rats who finish. Your sample distribution must be close to normal in order to apply the usual statistical tests of hypotheses. If your underlying distribution is far from normal, it may take a lot of samples to make the sample distribution close to normal. This is why it helps to transform your coordinates to bring the underlying distribution closer to normal. .3 is one example of this, using the reciprocal transform y = 1/x. Note that sample distribution in the paragraph above is sometimes, and more correctly, called the distribution of sample means.
1464.11	or maybe a Z-deficiency	VMSDEV::HALLYB	The Smart Money was on Goliath	`Thu Jul 11 1991 11:25`	10
	If you stop the experiment after 3 minutes, then clearly you don't want to (nor did you) say the rat took 3 minutes to finish. That would be wrong, because the rat didn't finish in 3 minutes. By the same logic you can't put down oo either, because you have no proof that it would take the rat an infinite time to navigate the maze. An interesting special case would be if the rat died while trying to navigate the maze (perhaps an overdose of Vitamin Z :-) but it would probably be wisest to discard those special cases. John
1464.12	Comparing samples	BHUNA::PFANG		`Fri Jul 12 1991 08:53`	38
	RE: <<< Note 1464.9 by IMTDEV::ROBERTS "Reason, Purpose, Self-esteem" >>> -< Rats! >- So the real question is not "What's the mean and spread?" The real question is: "Is sample A significantly different from sample B?" If you want to answer the 2nd question, then don't worry about which descriptive statistic to use to represent the location (e.g. harmonic mean) or spread (harmonic standard deviation?). The question, then becomes which comparison test is appropriate for comparing the two groups of data. For example, if the data followed a simple `normal' distribution, one might investigate using the t-test to compare the samples. In this case, we don't know much about the distributions, but it seems certain that it doesn't come from a normal distribution. A fairly robust comparison test is called the Mann-Whitney or Wilcoxon test. It is robust in that it is not sensitive to the shape of the underlying distribution, nor is it sensitive to `wide' variations (like infinities) in the data. Essentially what the test does is rank each data value based on where it falls in the total order of both groups. In this case, the infinities rank just higher than the next highest value (2.6 minutes). The proper way to do this comparison, would be to state a priori, what risk one is willing to take in making an incorrect conclusion. For example, a typical statement would be `I want to say vitamin Z helps rats navigate the maze, and I want to take no more than a 5% risk of being wrong.' The answer in this case would be `Vitamin Z does not help.' This is because, based on the Mann-Whitney test, there is a .1615 probability that the A and B samples were drawn from a single distribution.
1464.13	Un-asking the question	AGOUTL::BELDIN	Pull us together, not apart	`Fri Jul 12 1991 10:10`	102
	re from .9 on Your consultants are giving you good advice. Think about the way the data is generated and how to use standard parametric or non-parametric methods before you invoke the harmonics. Here is what I would say to a client consulting me with this problem. Comparing two samples of maze running times The problem is to determine whether two samples of experimental animals, previously treated with treatments A and B, differ statistically in 1) the length of time taken to run the maze, or 2) the fraction of animals able to complete the test. 1. Let us start by defining two families of random variables, X(i) and Y(i). 1.1 Y(i) is a Boolean variable which records whether or not the i'th subject completed the maze. 1.2 Define X(i) conditionally on Y(i) being true, as the length of time required by the i'th subject to complete the maze. If Y(i) is false, then X(i) is undefined. 2. Extend the notation with a parameter S which represents the treatment to which an animal is subjected. S can assume the values A and B. With this addition, we describe the random variables X and Y as 2.1 Y(i,S) is the Boolean response to S of the i'th subject. 2.2 X(i,S) is the conditionally defined response time of the i'th subject to stimulus S. 3. The hypotheses which interest us are: 3.1 The distribution of Y(i,S) for i=1..N is independent of the choice of S. 3.2 The distribution of X(i,S) for i in {k\|Y(k,S)} is independent of the choice of S. 4. We formulate the response models as: 4.1 Y(i,A) is a Bernoulli variable with parameter a, and Y(i,B) is a Bernoulli variable with parameter b. 4.2 X(i,A) is a linear function, A + Z(i) and X(i,B) is a linear function, B + Z(i). 4.3 Z(i) are independent random variables with mean m and standard deviation s. Note that you can't estimate m by itself because it will be confounded by A or B. The best you can do is estimate A+m, B+m, or A-B. You don't need to use harmonic means and standard deviations (for which there is no reliable theory). 5. Next we need to specify how the stimuli, S, were assigned to the subjects. There are several cases: 5.1 The subjects were randomly subdivided into two sub-samples and each member of one sub-sample was treated with A and each member of the other sub-sample was treated with B. 5.2 The subjects were individually and independently assigned treatment (or stimulus) A or B. 5.3 The subjects were presented to the experimenters in serial fashion, each subject was then assigned A or B. 5.4 Subjects were classified on several traits which might be correlated with X and Y. Then random assignments of A and B were made, attempting to balance the distribution of the correlated traits. 5.5 The same subjects were treated, sometimes with A and sometimes with B. (This design is only useful if the effect is expected to be temporary.) 5.6 The subjects were allowed to self-select whether they would receive treatment A or B (by making both available, for example). All of this is just to show that we can study the set of data to a variety of levels of detail and for a variety of experimental designs, all of which appear to be consistent with the original description, but which in no case require (or even suggest) the use of the harmonic mean. Dick
1464.14	agreement with previous two	CSSE::NEILSEN	Wally Neilsen-Steinhardt	`Fri Jul 12 1991 13:51`	13
	.12 and .13 have some good stuff. An extension to .12 is you can also ask what risk you are willing to take of incorrectly making incorrectly the statement Vitamin Z has no effect The two risks together can be combined with your understanding of the system you are studying to decide how to design your experiment: "I'll accept a 5% risk type I error (as in .12) and a 10% risk of making a type II error (as above), and the assumptions of Mann-Whitney look correct for my system, so I'll need to run 38 trials."