[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1184.0. "HELP! Compound Statistical Estimation..." by DWOVAX::YOUNG (If it bleeds, we can kill it.) Mon Jan 22 1990 23:59

    I have a tricky problem in Probability and Statistics that I need some
    serious help on.
    
    I am developing an application that is sampling different data on a 
    regular periodic basis.  It will be do something like making K
    collections of different data, each contaning N samples:
    
    Item 1:   (S11, S12, ... S1n)
    Item 2:   (S21, S22, ... S2n)
     ...
    Item K:   (Sk1, Sk2, ... Skn)
    
    What each of these Items (of N samples each) represents is a
    component value of a larger sum which is what I am really interested
    in:
    
    	Sum of ( Item (i) )	For (i = 1 to K )
    
    Now from this data, I need to calculate to statistical "results".  The
    first is average Sum, and the second is the "Peak" Sum.
    
    Average Sum is pretty straight forward, it represents the average
    (arithmetic mean) of the Sums in the sample population.  This is easily
    calculated by taking the average of the samples for each item and the
    summing them.
    
    The Peak Sum, however, is a problem.  The "Peak" Sum is supposed to
    represent something like the 80th percentile value of the Sums in the
    sample population.  (Actually, the number could be made higher, like
    95%, if a can still get sufficient accuracy).  So how do I calculate 
    this from my data?  
    
    In a straight forward manner, I figure that I could just compute all of
    Sums of all of the combinations of all the values that I collected,
    then sort them and the count down them until I reach the 80th
    percentile mark.  The problem with this that it must be a fairly
    efficient computation, and my number of samples and Items would make
    that prohibitive.  The Average Sums function is O(K*N), but my Peak
    Sums function is *at least* O(N**K).
    
    For instance, if I have 7 Items of 4 samples each, the Average
    computation would take 28 operations, whereas the Peak computation would
    take over 16000 operations (assuming that I could sort in linear time). 
    And it is likely that I will in fact be dealing with more items and
    samples than this.
    
    SO:
    	Can any one come up with a *significantly* more efficient way to
    make this computation?  I am willing to sacrifice some accuracy (but
    not too much) for speed here.
    
    --  Barry

T.R	Title	User	Personal Name	Date	Lines
1184.1		KOBAL::GILBERT	Ownership Obligates	`Tue Jan 23 1990 12:08`	10
	I'm sure I don't understand something here. What's wrong with this? -- Compute the K sums in O(NK) time. Find the 80th percentile by (say) sorting the K sums in O(KlogK) time, and count to the 80th percentile. That's certainly not as bad as O(N**K). - Gilbert BTW, Faster methods are available for finding the Kth largest element in an N-element collection.
1184.2	Raj Jain's algorithm?	WEEKS::HALLYB	The Smart Money was on Goliath	`Tue Jan 23 1990 12:32`	5
	Why not use the P� algorithm for estimating percentiles on the fly, in real time, with O(1) extra storage required? You sacrifice some, but not much, accuracy. John
1184.3	the central limit theorem helps here	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Tue Jan 23 1990 12:42`	77
	re: <<< Note 1184.0 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>> > I am developing an application that is sampling different data on a > regular periodic basis. It will be do something like making K > collections of different data, each contaning N samples: > What are typical values of N and K? > Now from this data, I need to calculate to statistical "results". The > first is average Sum, and the second is the "Peak" Sum. Do you need to evaluate Average and Peak just once, or do you want to repeat this process many times? When repeating, how much information will carry over from one "experiment" to another? > The Peak Sum, however, is a problem. The "Peak" Sum is supposed to > represent something like the 80th percentile value of the Sums in the > sample population. (Actually, the number could be made higher, like > 95%, if a can still get sufficient accuracy). So how do I calculate > this from my data? This is definitely a strange number. I would seriously ask why this number is needed. How will it be used next? I suspect the answer is something like "We need to design a process which will work for everything but extreme cases, so we want to know the value we can design for where it will only fail 20% (or maybe 5%) of the time." In general, this is bad design and bad statistics. A much better approach is to understand the cost of improving the design and the cost of failure. Then use the statistics available to estimate the probability of failure and thus the expected cost of the design. Improve the design until the marginal cost of improvement equals the marginal cost of failure. The statistics I will give below will actually allow you to look at either probability of failure or Peak Sum. Start by computing the mean and variance for each of the K items. Compute the mean and variance of the sum using Mean of Sum = SUM Mean( Item I ) I = 1 ... K Variance of Sum = SUM Variance( Item I ) I = 1 ... K The first is an equation you already had, and the second is a similar equation, obtained by a similar proof. These equations require that the Items be statistically independent. In practice, they sometimes fail, particularly for very low probability events, because the Items are slightly dependent on each other. You should look carefully at your real problem to be sure the items are independent. Now you would like to use the Central Limit Theorem, which says that the probability distribution of the Sum will be approximately normal. This will be a good approximation if K is large, say over 30. It may also be a good approximation if K is small, but the probability distributions of the items are well behaved: a single hump and rapidly decreasing tails. Check this before you go further. If the distribution of the sum is normal, then you know all about it once you know its mean and variance. You can use a table of the normal distribution to answer questions like what is the probability of a sum greater than x? what is the x such that the probability of a sum greater than x is exactly 20%? Given typical cost functions, you can also answer questions like what is the probability of failure for this design? what is the expect cost of failure for this design? what is the marginal decrease in cost for improving this design? If the distribution of the sum is not normal, then all is not lost, but you need to obtain this distribution and work with it directly. Even if it is not normal, it may have a recognized form, like Gamma or Weibull. In this case, you may want to understand the real situation producing the Item distributions, and use that understanding.
1184.4	there are N sums to compute	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Tue Jan 23 1990 12:47`	9
	re: <<< Note 1184.1 by KOBAL::GILBERT "Ownership Obligates" >>> > Compute the K sums in O(NK) time. Find the 80th percentile by (say) > sorting the K sums in O(KlogK) time, and count to the 80th percentile. > That's certainly not as bad as O(N**K). A minor error, if I understand you. There are N sums to compute, doing it your way, because the sum runs from 1 to K, and there are N samples available.
1184.5	Answers & more questions...	DWOVAX::YOUNG	If it bleeds, we can kill it.	`Tue Jan 23 1990 23:51`	47
	Re .1: Wally is right in .4, Peter. The individual components items are different and cannot really be compared. It is the sum of those items that is important, and the resultant sums that I was talking about sorting. Re .2: Sounds interesting. So what is P� ? Re .3: > This is definitely a strange number. I would seriously ask why this > number is needed. How will it be used next? I suspect the answer is > something like "We need to design a process which will work for > everything but extreme cases, so we want to know the value we can > design for where it will only fail 20% (or maybe 5%) of the time." > In general, this is bad design and bad statistics. A much better Sorry, Wally. It is not really anything like that. If it helps to think of it like this, thats fine, in which case I would have to say that the process is already cast in stone and we cannot change the design. We do need to dynamically monitor what it is doing however. To answer some of your other questions: Typical values of N are 4 to 8. Typical values of K are 20 to 60. We need to evaluate the Average and Peak just once for each sample population, but there are MANY sample populations, and more being created all of the time, making speed important. No data carries over from one sample population to the next. The distribution of the samples within the individual is mostly, but not 100%, independent of the other Items. Distribution of samples within an Item is well-behaved (ie. "Normal-Like"). If I remember correctly, Variance is just the square of the Standard Deviation, right? As for looking up the normal distribution points in a table, I would rather use an approximation formula, can anyone suggest one? -- Barry
1184.6	P� in more detail	VMSDEV::HALLYB	The Smart Money was on Goliath	`Wed Jan 24 1990 11:30`	22
	> Sounds interesting. So what is P� ? Developed by Raj ERLANG::JAIN, it's an algorithm for estimating percentiles (actually, histograms) in real time. You tell it "OK, I've got N buckets. Here's observation I." for I = 1..K. Note that only 2N longwords need be stored; N for the bucket boundaries (or is it N+1) and N for the count of elements in each bucket. As each observation is encountered the algorithm modifies the bucket boundaries so that the final result is close to what you would get if you stored and sorted everything. It thus runs in O(K) time. Sounds almost hocus-pocus, but WAS published in CACM, attacked and well-defended in letters to the editor. PL/I source code is available from Raj. It's not long, and should be easily recodable into FORTRAN, C, etc. Or take the MACRO output from the PL/I compiler, if you have a really thrifty customer... John
1184.7	Comments	CADSYS::COOPER	Topher Cooper	`Wed Jan 24 1990 12:16`	63
	RE: .1, .4, .5 (Cost of sort and count) The point still stands. The cost of doing it this way should be O(Nk + NlogN) which is considerably less than the O(N^K) mentioned. Given the typical sizes of N and k given, this is a better result than mentioned in .1. This would be the correct method to use if you knew nothing about the distribution of the K sums, i.e., if you could not guarentee their conformance to normality or any other well behaved distribution. Given rough normality, Wally's method is both faster and more accurate. RE: .5 > As for looking up the normal distribution points in a table, I would > rather use an approximation formula... That can only slow down your application. If you treat your value as strictly normally distributed, then you only need a single value for the application (two if you calculate both the 80% and the 95% figure). Given the mean sum M, and the variance of the sums V, your peak is simply: P[80%] = M + SQRT(V)0.84 or P[95%] = M + SQRT(V)1.645 (but see below) > If I remember correctly, Variance is just the square of the Standard > Deviation, right? Basically, although technically its the other way around: in general the Standard Deviation is defined to be the square-root of the Variance. However: The normal distribution should not be used here. It should only be used when the standard deviation is known in advance, rather than estimated by looking at the sample. What should be used is the "t" distribution with N-1 degrees of freedom. The formual above then becomes: P[80%] = M + SQRT(V)*T80[N-1]; where T80 is an appropriately initialized vector. It is probably simpler to type in the values needed than to include a formula, especially if you can set a reasonable upper limit on N like 10. It might be better (depending on details that are not too clear to me in your situation) to take each sum as representing an estimated true sum, which would therefore contribute variance to the overall estimate. I would have to think about how to implement that, however, and since I don't know that it is necessary, I won't bother. Topher
1184.8	a few more answers and questions	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Wed Jan 24 1990 12:51`	107
	Re: <<< Note 1184.5 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>> > Sorry, Wally. It is not really anything like that. If it helps to > think of it like this, thats fine, in which case I would have to say > that the process is already cast in stone and we cannot change the > design. We do need to dynamically monitor what it is doing however. No apology needed. I guessed based on my experience with people who use percentiles, but I guessed wrong here. I am still not clear on what you are doing and why you want this number. So let me take another guess, and you can tell me where I go wrong. Note that this is important. Nobody can give you good statistical advice unless they understand the context. You have a process of some kind which generates a batch of N samples, each sample containing K items. You need to monitor the process in real time and react to it somehow. You expect some large number M of batches, where M is over 100. The statistic of most interest to you is the sum of values of the K items within each sample. You are concerned about one tail of this distribution. Assuming I am right so far, I have a few questions. Can you express "what it is doing" as a state of the process? Can you define the actions you would take, based on the state of the process: continue monitoring, discard output, shut it down, adjust it, run like hell, and so forth? Can you create a cost, payoff or loss table which shows the cost of every action in every state of the process? Can you relate your statistics, like Average Sum and Peak Sum, to the state of the system? Can you derive decision rules which say take this action if the statistics fall within this range? The above questions reflect the usual way of approaching problems like this: Define the states, the actions, the payoff table, the relation of samples and statistics to the states, and the decision rules. Even if you don't want to be formal or detailed about this, it is a good structure for understanding what you are doing. > Typical values of N are 4 to 8. This is rather a small number. If this reflects a significant limitation on your sampling, like cost, then you want to substitute the Student's t distribution for the normal distribution in all further discussion of the sampling distribution. You usually won't have enough information to make a good estimate of the mean and variance. > Typical values of K are 20 to 60. This is a nice large number. In combination with your later comment about the niceness of the item distributions, this is probably enough to ensure that the underlying distribution of Sum is approximately normal. > We need to evaluate the Average and Peak just once for each > sample population, but there are MANY sample populations, and more > being created all of the time, making speed important. So, in my terms above M is a large number? > No data carries over from one sample population to the next. You may want to look more closely at this. The question is what knowledge carries over, not what data. Assuming that there is a single physical process behind all this, then some knowledge is likely to carry over. In many cases it can be shown that the form of the probability distributions is constant. In some cases it can be shown that the variance is relatively constant, and the mean varies from sample to sample. In other cases, the mean is constant and the variance changes with time. This is worth looking closely at, since this carry-over knowledge is really the basis of whatever statistical analysis you are doing. > The distribution of the samples within the individual is mostly, > but not 100%, independent of the other Items. This may present a risk. By the rules of statistical analysis, you should just look at the N sum-over-item values you get from each batch of samples. You should do nothing at all with the individual item values. And the rules for means and variances I gave in my earlier reply do not apply here. If you apply those rules, you will get different (probably smaller) tails than really exist. Since you are looking at peak behavior, you want the tails to be correct. If you can convince yourself that the dependence within items is "small enough" they you could still use the normal distribution. > If I remember correctly, Variance is just the square of the > Standard Deviation, right? Right. > As for looking up the normal distribution points in a table, I would > rather use an approximation formula, can anyone suggest one? There is another topic (maybe several) with pointers to approximations to the normal aka gaussian distribution. Note that depending on what you want to do, you may need only one point, which you can hard code into your program: 5% of the area of a normal distribution lies above the mean plus 1.645 times the standard deviation. Note that the normal distribution may not be the one you need, based on the small size of N.
1184.9		DWOVAX::YOUNG	If it bleeds, we can kill it.	`Fri Jan 26 1990 00:47`	31
	Re .7: OK, I now understand what Peter and you are talking about. However, N seems to be too small in my case for this to be a good approach, 4 to 6 points do not seem to be much too work with when searching for an 80th percentile (let alone a 95th). And, yes, after I wrote my last note I did figure out that I only needed one point on the Normal Distribution Table to make the translation. Students T distribution is new to me, I am going to look over it in my Probability & Statistics books. I'll be back if I have any questions. Re .8: Hmmm... Well, I don't think that I can really answer most of your questions Wally. But trying to answer them has given me a lot to think about. What do we do when our numbers get really bad? Basically we start examining in more detail trying to find the cause and correct, if we can. We never shut it down during regular operations, because we are supplying "product" to other processes, and no matter how bad our product is, it is still better than none at all. Right now, I am leaning towards "Wally's Method", but I do have a lot to think about here. I'll let you know when I come to a conclusion, or have more questions. -- Barry
1184.10	two simple approaches	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Fri Jan 26 1990 12:21`	37
	Re: <<< Note 1184.9 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>> > What do we do when our numbers get really bad? Basically we start > examining in more detail trying to find the cause and correct, if we > can. We never shut it down during regular operations, because we are > supplying "product" to other processes, and no matter how bad our > product is, it is still better than none at all. This is beginning to sound a lot like it needs a traditional control chart. Analyze a whole bunch of historical data to create control charts for mean and standard deviation. You can use the standard rules of thumb or your thinking about .8 to set the control limits. When you get a sample, take your N sums, compute a mean and standard deviation from those numbers, and plot them on the control charts. When the charts tell you the process is out of control, start your detailed examination. There is a lot of literature on this, usually involving words like statistical quality control or industrial process control. It generally does not assume much statistical knowledge in the reader. Another approach is two-dimensional, and allows for the interaction of mean and standard deviation. For each sample of N sums, plot its mean on the x axis and its standard deviation (or variance) on the y axis. Now divide the plane into green zone, yellow zone and red zone. The dividing lines will probably run from upper left to lower right, with green in lower left and red in upper right. You can draw the lines based on your historical data, gut feel and/or the answers to questions in .8. It is possible to calculate formally the position of the zones, but the results are often no better than gut feel. When a sample falls in the green zone, do nothing. When it falls in the yellow zone, start to watch carefully. When it falls in the red zone, start your detailed examination. Why use the standard deviation instead of the Peak? There are rather obscure arguments which suggest that mean and standard deviation together do a better job of summarizing small samples drawn from roughly normal distributions than alternatives, like peak, median or range.