[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1464.0. "Harmonic Standard Deviation" by IMTDEV::ROBERTS (Reason, Purpose, Self-esteem) Wed Jun 26 1991 17:58

    I need to calculate some simple statistics on some sample times. The
    stats I need are the harmonic mean and harmonic standard deviation. I
    have no problem calculating the harmonic mean. The harmonic standard
    deviation is giving me some trouble.

    Harmonic mean:  		    n
    			------------------------
    			1/s1 + 1/s2 + ... + 1/sn

    where s1 through sn are the n samples.

    In my experiment, the first three s's are

    s1=8.733333
    s2=7.716667
    s3=6.700000

    The harmonic mean is 7.626850.

    How do I calculate the standard deviation?

    Dwayne

T.RTitleUserPersonal
Name
DateLines
1464.1GUESS::DERAMOduly notedThu Jun 27 1991 01:4313
        Since
        
        	1/(harmonic mean) = arithmetic mean of reciprocals,
        
        then I would guess that the harmonic standard deviation
        is given by
        
        	1/(harmonic sd) = standard deviation of reciprocals
        
        It sounds logical; however, I do not recall ever having
        seen the term "harmonic standard deviation" before.
        
        Dan
1464.2DoubtsIMTDEV::ROBERTSReason, Purpose, Self-esteemThu Jun 27 1991 15:2630
    Hi, Dan

    Well, that's pretty much what I guessed, too, but I'm doubtful.

    Looking at the arithmetic mean (7.716667) and the harmonic mean
    (7.626850), they're quite close.

    But looking at the standard deviation (1.016667) and 1/(sd of
    reciprocals) (57.388320), there's quite a disparity. Also consider that
    the closer the values of the samples, the larger 1/(sd of reciprocals)
    grows!

    I don't know that there is such a thing as a harmonic standard
    deviation, but I'd expect there to be some way to describe the
    disbursement around a harmonic mean. We should be able to solve
    problems like the following:

    Given two sample sets of times:

    	A (a sample of the population ALPHA) = { 5, 6, 7 }
    	B (a sample of the population BETA)  = { 4, 6, infinity }

    what's the probability the average time of BETA is less than the
    average time of ALPHA?

    The harmonic mean of A is 5.88785; that of B is 7.20000. But what are
    the standard deviations?

    Dwayne

1464.3A Possible ApproachVAXRT::BRIDGEWATEREclipsing the pastThu Jun 27 1991 21:4938
    I don't know the actual underlying situation that's being modeled
    here, so what follows may not be all useful.  I'd think that if a
    harmonic mean is more useful that an arithmetic mean, it means that
    you are really interested in 1/units rather than units (for example
    rate rather than time).  Using this view, the regular standard
    deviation of the reciprocals of the sample points would be most useful
    as a measure of dispersion.  It is measured in 1/units.  You could
    translate this to units by taking the reciprocal, but it's not
    meaningful as a measure of dispersion in any way that I can tell.

    Continuing in this vein, you might want to calculate (in units) points
    lying a certain number of standard deviations from the mean.  This
    can be done by calculating everything using the reciprocal distribution
    and then taking the reciprocal of the result to convert it from 1/units
    to units.

    An example:

    s[1] = 1 units
    s[2] = 2 units
    s[3] = 3 units

    measured in units
        arithmetic mean = 2.000000            std dev = 0.816497
        -z sd = 2 - 0.816497*z                -1 sd = 1.183503
        +z sd = 2 + 0.816497*z                +1 sd = 2.816497

    measured in 1/units
        reciprocal mean = 0.611111            reciprocal std dev = 0.283279
        -z sd = 0.611111 - 0.283279*z         -1 sd = 0.327832
        +z sd = 0.611111 + 0.283279*z         +1 sd = 0.894390

    measured in units
        harmonic mean   = 1.636364 (z=0)
        -z sd = 1 / (0.611111 - 0.283279*z)   -1 sd = 3.050341
        +z sd = 1 / (0.611111 + 0.283279*z)   +1 sd = 1.118081

    - Don
1464.4IMTDEV::ROBERTSReason, Purpose, Self-esteemMon Jul 01 1991 12:0416
    Thanks, Don.
    
    I have two questions about your example, please:
    
    1. How did you arrive at the reciprocal std dev of 0.283279? I can't
    seem to come up with the same number. Are you taking the s dev of {1/1,
    1/2, 1/3}?
    
    2. In the last "measured in units" section, you show the harmonic mean
    to be 1.636364. But then you use the reciprocal mean 0.611111 in the
    expressions for z sd. Was this intentional?
    
    Again, I appreciate your help.
    
    Dwayne
    
1464.5VAXRT::BRIDGEWATEREclipsing the pastWed Jul 03 1991 06:1541
    Re: .4

    >1. How did you arrive at the reciprocal std dev of 0.283279? I can't
    >seem to come up with the same number. Are you taking the s dev of {1/1,
    >1/2, 1/3}?
    
    Yes, the arithmetic mean of {1/1, 1/2, 1/3} is 11/18 = 0.611111 and the
    standard deviation is sqrt(26)/18 = 0.283279.  Actually, if these are
    samples, we should use n-1 = 2 as the divisor inside the square root
    for the standard deviation, but I calculated the "population" standard
    deviation which uses n = 3 as the divisor.  So, in detail:

    s = sqrt ((1/1-11/18)**2 + (1/2-11/18)**2 + (1/3-11/18)**2))/3)

      = sqrt ((7/18)**2 + (-2/18)**2 + (-5/18)**2)/3)

      = sqrt ((49+4+25)/(3*18*18))

      = sqrt (78/3) / 18

      = sqrt (26)/18 = 0.283279


    >2. In the last "measured in units" section, you show the harmonic mean
    >to be 1.636364. But then you use the reciprocal mean 0.611111 in the
    >expressions for z sd. Was this intentional?
    
    Yes, I'm using the reciprocal mean there.  Remember this is because my
    analysis uses the distribution and statistics of the reciprocals as
    the frame of reference.  In this view there isn't a single number that
    represents a "harmonic standard deviation".  Instead, the best that you
    can do is calculate the point that represents z standard deviations above
    or below the mean using the reciprocals and then take the reciprocal of
    the result to express the result in the original units of the sample.

    This kind of calculation would generally be useful if you were trying
    to calculate confidence intervals.  But to use them successfully in this
    way, you'd also need to know more about the probability distribution of
    the sample reciprocals.

    - Don
1464.6Idle Statistics Churning, FWIW:CHOSRV::YOUNGStill billing, after all these years.Wed Jul 03 1991 18:0726
    Since:					Where:
    
    	Am =  SUM(i=1,N) S(i)/N			:  "Am" is Arithmetic Mean
    
    and
    
    	Hm = N/( SUM(i=1,N) 1/S(i) )		:  "Hm" is Harmonic mean
    
    
    And
    
    	Sd = SQRT( ( SUM(i=1,N) (Am-S(i))^2 )/N )
    						:  "Sd" is Standard Deviation
    
    Then I would assume:
    
    	Hd = SQRT( N/( SUM(i=1,N) 1/(Hm-S(i))^2 ) )
    
    Where "Hd" is the Harmonic Deviation.  For the values that you gave,
    this results in a value of .15434, which is much closer to your
    Standard Deviation.
    
    Of course I am using the formulas for a population not a sample, so you
    may have to adjust this.
    
    --  Barry
1464.7VAXRT::BRIDGEWATEREclipsing the pastMon Jul 08 1991 13:0911
    Re: .6

    This is an interesting approach I hadn't thought of.  Unfortunately,
    it appears to be sensitive to how close any one of the sample points is
    to the harmonic mean.  That is, if any one sample point is very close
    to the harmonic mean, the "harmonic standard deviation" is very small
    regardless of how distant the other samples are.  If you are unlucky
    enough to get one sample point equal to the harmonic mean, then the
    harmonic standard deviation is zero.

    - Don
1464.8What is the problem that you are trying to solve?CHOSRV::YOUNGStill billing, after all these years.Tue Jul 09 1991 12:0712
    True.  But the question that is begging to be asked here is: "Why are
    you using Harmonic Means in the first place?"
    
    Harmonic means are a somewhat unusual Figure of Merit for a dataset in
    the first place.  Presumably they were choosen with some reason in mind
    and that reason should lead us to some logic for determining an
    appropiate Deviation measurement.
    
    Without this, you really cannot say that one measure is better than
    another, or that one measure of Deviance is better than another.
    
    --  Barry
1464.9Rats!IMTDEV::ROBERTSReason, Purpose, Self-esteemTue Jul 09 1991 18:2824
    Harmonics were chosen because the sample set can include infinite
    values, but no zero values; that is, {5, 10, 100, oo, oo} is possible.
    In this example, the arithmetic mean is oo; but the harmonic mean is
    16.1. The units involved are minutes. "oo" means it didn't finish.

    If you want a concrete example, suppose you're measuring the time it
    takes two batches of rats to navigate a maze. Batch "A" takes vitamin
    Z, while batch "B" is the control. Each batch has 9 rats. The sample
    times in minutes are:

    				      r a t
    	1	2	3	4	5	6	7	8	9

    A	1.7	1.0	oo	2.1	1.6	1.2	0.9	1.3	1.8

    B	1.6	oo	1.9	oo	2.6	1.8	2.1	1.0	2.2

    What's the probability that vitamin Z helps rats navigate the maze?

    The harmonic mean of batch A's times is 1.5 minutes; that of batch B
    is 2.2 minutes.

    Dwayne

1464.10you gotta know the territoryCSSE::NEILSENWally Neilsen-SteinhardtWed Jul 10 1991 13:5848
This topic illustrates a general rule in statistics: the more you understand
the process the better you can apply statistics to it.  The general question in
.0 has no answer, because too little is known to define an answer.

Based on the example in .9, I would suggest two things:

1 - compute the fraction of rats finishing in each case.  There are good tests
to determine whether the differences are statistically significant.

2 - look at the distribution of times for rats who finish in each case.  Check
the form of this distribution against some standard forms, like normal, 
log-normal, Gaussian, exponential and so forth.  If necessary, transform 
your coordinate to one which gives a more nearly normal distribution.  Use the
standard hypothesis test to determine whether the difference in times for rats
who finish is statistically significant.


Several more random comments:

All the above is based on .9 and my general ignorance of rats and vitamins.  If
I knew more, for example, the usual distribution of maze running times, the 
usual fraction of rats not completing, the common effects of vitamins, and so
forth, I would want to incorporate that additional information in my 
statistical approach.  Similarly, if your real problem is one of computer
hardware performance analysis, then I would look for completely different
information, and incorporate that into my statistical analysis.

If there is a usual practice in your field of study, you should follow it. 
Particularly in government and commercial work, it is usually better to follow
standard practice, even if there is a statistical method which is in theory
superior.

You can't really put down infinity for any times, since infinity is well known
to be longer than the average lifetime of a graduate student.  Some of those
rats might eventually have found their way out of the maze.  All you really know
is the time at which you terminated that run.  If you include that time, you
are introducing an extraneous factor which may affect your results; this is why
I recommended analyzing only the times for the rats who finish.

Your sample distribution must be close to normal in order to apply the usual
statistical tests of hypotheses.  If your underlying distribution is far from
normal, it may take a lot of samples to make the sample distribution close to
normal.  This is why it helps to transform your coordinates to bring the 
underlying distribution closer to normal.  .3 is one example of this, using
the reciprocal transform y = 1/x.

Note that sample distribution in the paragraph above is sometimes, and more 
correctly, called the distribution of sample means.
1464.11or maybe a Z-deficiencyVMSDEV::HALLYBThe Smart Money was on GoliathThu Jul 11 1991 12:2510
    If you stop the experiment after 3 minutes, then clearly you don't
    want to (nor did you) say the rat took 3 minutes to finish.  
    That would be wrong, because the rat didn't finish in 3 minutes.
    By the same logic you can't put down oo either, because you have no 
    proof that it would take the rat an infinite time to navigate the maze.  
    An interesting special case would be if the rat died while trying to
    navigate the maze (perhaps an overdose of Vitamin Z :-) but it would
    probably be wisest to discard those special cases.
    
      John
1464.12Comparing samplesBHUNA::PFANGFri Jul 12 1991 09:5338
    RE:       <<< Note 1464.9 by IMTDEV::ROBERTS "Reason, Purpose, Self-esteem" >>>
                                   -< Rats! >-

    So the real question is not "What's the mean and spread?"
    
    The real question is: 
    	"Is sample A significantly different from sample B?"
    
    If you want to answer the 2nd question, then don't worry about which
    descriptive statistic to use to represent the location (e.g. harmonic
    mean) or spread (harmonic standard deviation?). The question, then
    becomes which comparison test is appropriate for comparing the two
    groups of data.
    
    	For example, if the data followed a simple `normal' distribution,
    one might investigate using the t-test to compare the samples. In this
    case, we don't know much about the distributions, but it seems certain
    that it doesn't come from a normal distribution.
    
    	A fairly robust comparison test is called the Mann-Whitney or
    Wilcoxon test. It is robust in that it is not sensitive to the shape of
    the underlying distribution, nor is it sensitive to `wide' variations
    (like infinities) in the data. Essentially what the test does is rank
    each data value based on where it falls in the total order of both
    groups. In this case, the infinities rank just higher than the next
    highest value (2.6 minutes).
    
    	The proper way to do this comparison, would be to state a priori,
    what risk one is willing to take in making an incorrect conclusion. For
    example, a typical statement would be 
    `I want to say vitamin Z helps rats navigate the maze, and I want to
    take no more than a 5% risk of being wrong.'
    
    The answer in this case would be `Vitamin Z does not help.'
    
    This is because, based on the Mann-Whitney test, there is a .1615
    probability that the A and B samples were drawn from a single
    distribution.
1464.13Un-asking the questionAGOUTL::BELDINPull us together, not apartFri Jul 12 1991 11:10102
    re from .9 on

Your consultants are giving you good advice.  Think about the way
the data is generated and how to use standard parametric or
non-parametric methods before you invoke the harmonics.  Here is
what I would say to a client consulting me with this problem.

    

           Comparing two samples of maze running times


   The problem is to determine whether two samples of experimental animals,
   previously treated with treatments A and B, differ statistically in 
   
      1) the length of time taken to run the maze, or 
      
      2) the fraction of animals able to complete the test.
      

   1. Let us start by defining two families of random variables, 
      X(i) and Y(i).
   
      1.1 Y(i) is a Boolean variable which records whether or
      not the i'th subject completed the maze.
         
      1.2 Define X(i) conditionally on Y(i)  being true, as the
      length of time required by the i'th subject to complete
      the maze.  If Y(i) is false, then X(i) is undefined.
         
   2. Extend the notation with a parameter S which represents
      the treatment to which an animal is subjected.  S can
      assume the values A and B.  With this addition, we
      describe the random variables X and Y as
      
      2.1 Y(i,S) is the Boolean response to S of the i'th
      subject.
      
      2.2 X(i,S) is the conditionally defined response time of
      the i'th subject to stimulus S.
         

   3. The hypotheses which interest us are:
   
      3.1 The distribution of Y(i,S) for i=1..N is independent
      of the choice of S.
         
      3.2 The distribution of X(i,S) for i in {k|Y(k,S)} is
      independent of the choice of S.
      
   
   4. We formulate the response models as:
   
      4.1 Y(i,A) is a Bernoulli variable with parameter a, and
      Y(i,B) is a Bernoulli variable with parameter b.
         
      4.2 X(i,A) is a linear function, A + Z(i) and X(i,B) is a
      linear function, B + Z(i).
         
      4.3 Z(i) are independent random variables with mean m and
      standard deviation s.  Note that you can't estimate m by
      itself because it will be confounded by A or B.  The best
      you can do is estimate A+m, B+m, or A-B.  You don't need to
      use harmonic means and standard deviations (for which there
      is no reliable theory).
   
      
   5. Next we need to specify how the stimuli, S, were assigned
      to the subjects.  There are several cases:
      
      5.1 The subjects were randomly subdivided into two
      sub-samples and each member of one sub-sample was treated
      with A and each member of the other sub-sample was treated
      with B.
      
      5.2 The subjects were individually and independently
      assigned treatment (or stimulus) A or B.
      
      5.3 The subjects were presented to the experimenters in
      serial fashion, each subject was then assigned A or B.
      
      5.4 Subjects were classified on several traits which might
      be correlated with X and Y.  Then random assignments of A
      and B were made, attempting to balance the distribution of
      the correlated traits.
      
      5.5 The same subjects were treated, sometimes with A and
      sometimes with B.  (This design is only useful if the
      effect is expected to be temporary.)
      
      5.6 The subjects were allowed to self-select whether they
      would receive treatment A or B (by making both available,
      for example).
        

All of this is just to show that we can study the set of data to
a variety of levels of detail and for a variety of experimental
designs, all of which appear to be consistent with the original
description, but which in no case require (or even suggest) the
use of the harmonic mean.

Dick
1464.14agreement with previous twoCSSE::NEILSENWally Neilsen-SteinhardtFri Jul 12 1991 14:5113
.12 and .13 have some good stuff.

An extension to .12 is you can also ask what risk you are willing to take of 
incorrectly making incorrectly the statement

	Vitamin Z has no effect

The two risks together can be combined with your understanding of the system
you are studying to decide how to design your experiment:

	"I'll accept a 5% risk type I error (as in .12) and a 10% risk of
	making a type II error (as above), and the assumptions of Mann-Whitney 
	look correct for my system, so I'll need to run 38 trials."