[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1184.0. "HELP! Compound Statistical Estimation..." by DWOVAX::YOUNG (If it bleeds, we can kill it.) Mon Jan 22 1990 23:59

    I have a tricky problem in Probability and Statistics that I need some
    serious help on.
    
    I am developing an application that is sampling different data on a 
    regular periodic basis.  It will be do something like making K
    collections of different data, each contaning N samples:
    
    Item 1:   (S11, S12, ... S1n)
    Item 2:   (S21, S22, ... S2n)
     ...
    Item K:   (Sk1, Sk2, ... Skn)
    
    What each of these Items (of N samples each) represents is a
    component value of a larger sum which is what I am really interested
    in:
    
    	Sum of ( Item (i) )	For (i = 1 to K )
    
    Now from this data, I need to calculate to statistical "results".  The
    first is average Sum, and the second is the "Peak" Sum.
    
    Average Sum is pretty straight forward, it represents the average
    (arithmetic mean) of the Sums in the sample population.  This is easily
    calculated by taking the average of the samples for each item and the
    summing them.
    
    The Peak Sum, however, is a problem.  The "Peak" Sum is supposed to
    represent something like the 80th percentile value of the Sums in the
    sample population.  (Actually, the number could be made higher, like
    95%, if a can still get sufficient accuracy).  So how do I calculate 
    this from my data?  
    
    In a straight forward manner, I figure that I could just compute all of
    Sums of all of the combinations of all the values that I collected,
    then sort them and the count down them until I reach the 80th
    percentile mark.  The problem with this that it must be a fairly
    efficient computation, and my number of samples and Items would make
    that prohibitive.  The Average Sums function is O(K*N), but my Peak
    Sums function is *at least* O(N**K).
    
    For instance, if I have 7 Items of 4 samples each, the Average
    computation would take 28 operations, whereas the Peak computation would
    take over 16000 operations (assuming that I could sort in linear time). 
    And it is likely that I will in fact be dealing with more items and
    samples than this.
    
    SO:
    	Can any one come up with a *significantly* more efficient way to
    make this computation?  I am willing to sacrifice some accuracy (but
    not too much) for speed here.
    
    --  Barry
T.RTitleUserPersonal
Name
DateLines
1184.1KOBAL::GILBERTOwnership ObligatesTue Jan 23 1990 12:0810
    I'm sure I don't understand something here.  What's wrong with this? --

    Compute the K sums in O(N*K) time.  Find the 80th percentile by (say)
    sorting the K sums in O(K*logK) time, and count to the 80th percentile.
    That's certainly not as bad as O(N**K).

						- Gilbert

    BTW, Faster methods are available for finding the Kth largest element
    in an N-element collection.
1184.2Raj Jain's algorithm?WEEKS::HALLYBThe Smart Money was on GoliathTue Jan 23 1990 12:325
    Why not use the P� algorithm for estimating percentiles on the fly,
    in real time, with O(1) extra storage required?  You sacrifice some,
    but not much, accuracy.
    
      John
1184.3the central limit theorem helps herePULSAR::WALLYWally Neilsen-SteinhardtTue Jan 23 1990 12:4277
    re:       <<< Note 1184.0 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>>
    
>    I am developing an application that is sampling different data on a 
>    regular periodic basis.  It will be do something like making K
>    collections of different data, each contaning N samples:
>    
    
    What are typical values of N and K?  
    
>    Now from this data, I need to calculate to statistical "results".  The
>    first is average Sum, and the second is the "Peak" Sum.
    
    Do you need to evaluate Average and Peak just once, or do you want to
    repeat this process many times?  When repeating, how much information
    will carry over from one "experiment" to another?  
    
>    The Peak Sum, however, is a problem.  The "Peak" Sum is supposed to
    >    represent something like the 80th percentile value of the Sums in the
>    sample population.  (Actually, the number could be made higher, like
>    95%, if a can still get sufficient accuracy).  So how do I calculate 
>    this from my data?  
    
    This is definitely a strange number.  I would seriously ask why this
    number is needed.  How will it be used next?  I suspect the answer is
    something like "We need to design a process which will work for
    everything but extreme cases, so we want to know the value we can
    design for where it will only fail 20% (or maybe 5%) of the time."
    In general, this is bad design and bad statistics.  A much better
    approach is to understand the cost of improving the design and the cost
    of failure.  Then use the statistics available to estimate the
    probability of failure and thus the expected cost of the design. 
    Improve the design until the marginal cost of improvement equals the
    marginal cost of failure.
    
    The statistics I will give below will actually allow you to look at
    either probability of failure or Peak Sum.
    
    Start by computing the mean and variance for each of the K items. 
    Compute the mean and variance of the sum using
    
    	Mean of Sum = SUM Mean( Item I )		I = 1 ... K
    
    	Variance of Sum = SUM Variance( Item I )	I = 1 ... K
    
    The first is an equation you already had, and the second is a similar
    equation, obtained by a similar proof.  These equations require that
    the Items be statistically independent.  In practice, they sometimes
    fail, particularly for very low probability events, because the Items
    are slightly dependent on each other.  You should look carefully at
    your real problem to be sure the items are independent.
    
    Now you would like to use the Central Limit Theorem, which says that
    the probability distribution of the Sum will be approximately normal.
    This will be a good approximation if K is large, say over 30.  It may
    also be a good approximation if K is small, but the probability
    distributions of the items are well behaved: a single hump and rapidly
    decreasing tails.  Check this before you go further.
    
    If the distribution of the sum is normal, then you know all about it 
    once you know its mean and variance.  You can use a table of the normal
    distribution to answer questions like
    
    	what is the probability of a sum greater than x?
    	what is the x such that the probability of a sum greater than x
    		is exactly 20%?
    
    Given typical cost functions, you can also answer questions like
    
    	what is the probability of failure for this design?
    	what is the expect cost of failure for this design?
    	what is the marginal decrease in cost for improving this design?
    
    If the distribution of the sum is not normal, then all is not lost, but
    you need to obtain this distribution and work with it directly.  Even
    if it is not normal, it may have a recognized form, like Gamma or
    Weibull.  In this case, you may want to understand the real situation
    producing the Item distributions, and use that understanding.
1184.4there are N sums to computePULSAR::WALLYWally Neilsen-SteinhardtTue Jan 23 1990 12:479
    re:            <<< Note 1184.1 by KOBAL::GILBERT "Ownership Obligates" >>>

>    Compute the K sums in O(N*K) time.  Find the 80th percentile by (say)
>    sorting the K sums in O(K*logK) time, and count to the 80th percentile.
>    That's certainly not as bad as O(N**K).
    
    A minor error, if I understand you.  There are N sums to compute, doing
    it your way, because the sum runs from 1 to K, and there are N samples
    available.
1184.5Answers & more questions...DWOVAX::YOUNGIf it bleeds, we can kill it.Tue Jan 23 1990 23:5147
    Re .1:
    
    Wally is right in .4, Peter.  The individual components items are
    different and cannot really be compared.  It is the sum of those items
    that is important, and the resultant sums that I was talking about
    sorting.
    
    Re .2:
    
    Sounds interesting.  So what is P� ?
    
    Re .3:
    
>    This is definitely a strange number.  I would seriously ask why this
>    number is needed.  How will it be used next?  I suspect the answer is
>    something like "We need to design a process which will work for
>    everything but extreme cases, so we want to know the value we can
>    design for where it will only fail 20% (or maybe 5%) of the time."
>    In general, this is bad design and bad statistics.  A much better
    
    Sorry, Wally.  It is not really anything like that.  If it helps to
    think of it like this, thats fine, in which case I would have to say
    that the process is already cast in stone and we cannot change the
    design.  We do need to dynamically monitor what it is doing however.
    
    To answer some of your other questions:
    
    	Typical values of N are 4 to 8.
    
    	Typical values of K are 20 to 60.
    
    	We need to evaluate the Average and Peak just once for each 
    	sample population, but there are MANY sample populations, and more
    	being created all of the time, making speed important.  No data
    	carries over from one sample population to the next.
    
    	The distribution of the samples within the individual is mostly,
    	but not 100%, independent of the other Items.  Distribution of
    	samples within an Item is well-behaved (ie. "Normal-Like").
    
    	If I remember correctly, Variance is just the square of the
    Standard Deviation, right?
    
    As for looking up the normal distribution points in a table, I would
    rather use an approximation formula, can anyone suggest one?
    
    --  Barry
1184.6P� in more detailVMSDEV::HALLYBThe Smart Money was on GoliathWed Jan 24 1990 11:3022
>    Sounds interesting.  So what is P� ?
    
    Developed by Raj ERLANG::JAIN, it's an algorithm for estimating
    percentiles (actually, histograms) in real time.  You tell it
    "OK, I've got N buckets.  Here's observation I." for I = 1..K.
    Note that only 2N longwords need be stored; N for the bucket
    boundaries (or is it N+1) and N for the count of elements in
    each bucket.
    
    As each observation is encountered the algorithm modifies the
    bucket boundaries so that the final result is close to what you
    would get if you stored and sorted everything.  It thus runs in
    O(K) time.
    
    Sounds almost hocus-pocus, but WAS published in CACM, attacked
    and well-defended in letters to the editor.
    
    PL/I source code is available from Raj.  It's not long, and should
    be easily recodable into FORTRAN, C, etc.  Or take the MACRO output
    from the PL/I compiler, if you have a really thrifty customer...
    
      John
1184.7CommentsCADSYS::COOPERTopher CooperWed Jan 24 1990 12:1663
RE: .1, .4, .5 (Cost of sort and count)
    
    The point still stands.  The cost of doing it this way should be
    
    		O(N*k + N*logN)
    
    which is considerably less than the O(N^K) mentioned.  Given the
    typical sizes of N and k given, this is a better result than mentioned
    in .1.
    
    This would be the correct method to use if you knew nothing about the
    distribution of the K sums, i.e., if you could not guarentee their
    conformance to normality or any other well behaved distribution.
    
    Given rough normality, Wally's method is both faster and more accurate.
    
RE: .5
    
    > As for looking up the normal distribution points in a table, I would
    > rather use an approximation formula...
    
    That can only slow down your application.  If you treat your value
    as strictly normally distributed, then you only need a *single* value
    for the application (two if you calculate *both* the 80% and the 95%
    figure).  Given the mean sum M, and the variance of the sums V, your
    peak is simply:
    
    		P[80%] = M + SQRT(V)*0.84
    
    or
    
    		P[95%] = M + SQRT(V)*1.645
    
    (but see below)
    
    > If I remember correctly, Variance is just the square of the Standard
    > Deviation, right?
    
    Basically, although technically its the other way around: in general
    the Standard Deviation is defined to be the square-root of the
    Variance.
    
However:
    
    The normal distribution should not be used here.  It should only be
    used when the standard deviation is known in advance, rather than
    estimated by looking at the sample.  What should be used is the
    "t" distribution with N-1 degrees of freedom.  The formual above
    then becomes:
    
    		P[80%] = M + SQRT(V)*T80[N-1];
    
    where T80 is an appropriately initialized vector.  It is probably
    simpler to type in the values needed than to include a formula,
    especially if you can set a reasonable upper limit on N like 10.
    
    It might be better (depending on details that are not too clear to
    me in your situation) to take each sum as representing an estimated
    true sum, which would therefore contribute variance to the overall
    estimate.  I would have to think about how to implement that, however,
    and since I don't know that it is necessary, I won't bother.
    
    					Topher
1184.8a few more answers and questionsPULSAR::WALLYWally Neilsen-SteinhardtWed Jan 24 1990 12:51107
    Re:      <<< Note 1184.5 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>>

    
>    Sorry, Wally.  It is not really anything like that.  If it helps to
>    think of it like this, thats fine, in which case I would have to say
>    that the process is already cast in stone and we cannot change the
>    design.  We do need to dynamically monitor what it is doing however.
    
    No apology needed.  I guessed based on my experience with people who
    use percentiles, but I guessed wrong here.  I am still not clear on
    what you are doing and why you want this number.  So let me take
    another guess, and you can tell me where I go wrong.  Note that this is
    important.  Nobody can give you good statistical advice unless they
    understand the context.
    
    You have a process of some kind which generates a batch of N samples,
    each sample containing K items.  You need to monitor the process in
    real time and react to it somehow. You expect some large number M of
    batches, where M is over 100. The statistic of most interest to you is
    the sum of values of the K items within each sample.   You are
    concerned about one tail of this distribution.
    
    Assuming I am right so far, I have a few questions.
    
    	Can you express "what it is doing" as a state of the process?
    
    	Can you define the actions you would take, based on the state of
    	the process: continue monitoring, discard output, shut it down, 
    	adjust it, run like hell, and so forth?
    
    	Can you create a cost, payoff or loss table which shows the cost of
    	every action in every state of the process?
    
    	Can you relate your statistics, like Average Sum and Peak Sum, to
    	the state of the system?
    
    	Can you derive decision rules which say take this action if the
    	statistics fall within this range?
    
    The above questions reflect the usual way of approaching problems like
    this:  Define the states, the actions, the payoff table, the relation
    of samples and statistics to the states, and the decision rules.  Even
    if you don't want to be formal or detailed about this, it is a good
    structure for understanding what you are doing.
    
>    	Typical values of N are 4 to 8.
    
    This is rather a small number.  If this reflects a significant
    limitation on your sampling, like cost, then you want to substitute the
    Student's t distribution for the normal distribution in all further
    discussion of the sampling distribution.  You usually won't have enough
    information to make a good estimate of the mean and variance.
    
>    	Typical values of K are 20 to 60.
    
    This is a nice large number.  In combination with your later comment
    about the niceness of the item distributions, this is probably enough
    to ensure that the underlying distribution of Sum is approximately
    normal.
    
>    	We need to evaluate the Average and Peak just once for each 
>    	sample population, but there are MANY sample populations, and more
>    	being created all of the time, making speed important.  
    
    So, in my terms above M is a large number?
    
    > No data carries over from one sample population to the next.
   
    You may want to look more closely at this.  The question is what
    knowledge carries over, not what data.  Assuming that there is a single
    physical process behind all this, then some knowledge is likely to
    carry over.  In many cases it can be shown that the form of the
    probability distributions is constant.  In some cases it can be shown
    that the variance is relatively constant, and the mean varies from
    sample to sample.  In other cases, the mean is constant and the
    variance changes with time.  This is worth looking closely at, since
    this carry-over knowledge is really the basis of whatever statistical
    analysis you are doing. 
    
>    	The distribution of the samples within the individual is mostly,
>    	but not 100%, independent of the other Items. 
    
    This may present a risk.  By the rules of statistical analysis, you
    should just look at the N sum-over-item values you get from each batch
    of samples.  You should do nothing at all with the individual item
    values.  And the rules for means and variances I gave in my earlier
    reply do not apply here.  If you apply those rules, you will get
    different (probably smaller) tails than really exist.  Since you are
    looking at peak behavior, you want the tails to be correct.
    
    If you can convince yourself that the dependence within items is "small
    enough" they you could still use the normal distribution.
    
>    	If I remember correctly, Variance is just the square of the
>    Standard Deviation, right?
    
    Right.
    
>    As for looking up the normal distribution points in a table, I would
>    rather use an approximation formula, can anyone suggest one?
    
    There is another topic (maybe several) with pointers to approximations
    to the normal aka gaussian distribution.  Note that depending on what
    you want to do, you may need only one point, which you can hard code
    into your program:  5% of the area of a normal distribution lies above
    the mean plus 1.645 times the standard deviation.  Note that the normal
    distribution may not be the one you need, based on the small size of N.
1184.9DWOVAX::YOUNGIf it bleeds, we can kill it.Fri Jan 26 1990 00:4731
    Re .7:
    
    OK, I now understand what Peter and you are talking about.  However, N
    seems to be too small in my case for this to be a good approach, 4 to 6
    points do not seem to be much too work with when searching for an 80th
    percentile (let alone a 95th).
    
    And, yes, after I wrote my last note I did figure out that I only
    needed one point on the Normal Distribution Table to make the
    translation.
    
    Students T distribution is new to me, I am going to look over it in my
    Probability & Statistics books.  I'll be back if I have any questions.
    
    Re .8:
    
    Hmmm...  Well, I don't think that I can really answer most of your
    questions Wally.  But trying to answer them has given me a lot to think
    about.
    
    What do we do when our numbers get really bad?  Basically we start
    examining in more detail trying to find the cause and correct, if we
    can.  We never shut it down during regular operations, because we are
    supplying "product" to other processes, and no matter how bad our
    product is, it is still better than none at all.
    
    Right now, I am leaning towards "Wally's Method", but I do have a lot
    to think about here.  I'll let you know when I come to a conclusion, or
    have more questions.
    
    --  Barry
1184.10two simple approachesPULSAR::WALLYWally Neilsen-SteinhardtFri Jan 26 1990 12:2137
    Re:      <<< Note 1184.9 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>>

>    What do we do when our numbers get really bad?  Basically we start
>    examining in more detail trying to find the cause and correct, if we
>    can.  We never shut it down during regular operations, because we are
>    supplying "product" to other processes, and no matter how bad our
>    product is, it is still better than none at all.
    
    This is beginning to sound a lot like it needs a traditional control
    chart.  Analyze a whole bunch of historical data to create control
    charts for mean and standard deviation.  You can use the standard rules
    of thumb or your thinking about .8 to set the control limits.  When you
    get a sample, take your N sums, compute a mean and standard deviation 
    from those numbers, and plot them on the control charts.  When the
    charts tell you the process is out of control, start your detailed
    examination.  There is a lot of literature on this, usually involving
    words like statistical quality control or industrial process control.
    It generally does not assume much statistical knowledge in the reader.
    
    Another approach is two-dimensional, and allows for the interaction of
    mean and standard deviation.  For each sample of N sums, plot its
    mean on the x axis and its standard deviation (or variance) on the y
    axis.  Now divide the plane into green zone, yellow zone and red zone.
    The dividing lines will probably run from upper left to lower right,
    with green in lower left and red in upper right.  You can draw the
    lines based on your historical data, gut feel and/or the answers to
    questions in .8.  It is possible to calculate formally the position of
    the zones, but the results are often no better than gut feel.  When a
    sample falls in the green zone, do nothing.  When it falls in the
    yellow zone, start to watch carefully.  When it falls in the red zone,
    start your detailed examination.
    
    Why use the standard deviation instead of the Peak?  There are rather
    obscure arguments which suggest that mean and standard deviation
    together do a better job of summarizing small samples drawn from
    roughly normal distributions than alternatives, like peak, median or
    range.