T.R | Title | User | Personal Name | Date | Lines |
---|
1184.1 | | KOBAL::GILBERT | Ownership Obligates | Tue Jan 23 1990 12:08 | 10 |
| I'm sure I don't understand something here. What's wrong with this? --
Compute the K sums in O(N*K) time. Find the 80th percentile by (say)
sorting the K sums in O(K*logK) time, and count to the 80th percentile.
That's certainly not as bad as O(N**K).
- Gilbert
BTW, Faster methods are available for finding the Kth largest element
in an N-element collection.
|
1184.2 | Raj Jain's algorithm? | WEEKS::HALLYB | The Smart Money was on Goliath | Tue Jan 23 1990 12:32 | 5 |
| Why not use the P� algorithm for estimating percentiles on the fly,
in real time, with O(1) extra storage required? You sacrifice some,
but not much, accuracy.
John
|
1184.3 | the central limit theorem helps here | PULSAR::WALLY | Wally Neilsen-Steinhardt | Tue Jan 23 1990 12:42 | 77 |
| re: <<< Note 1184.0 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>>
> I am developing an application that is sampling different data on a
> regular periodic basis. It will be do something like making K
> collections of different data, each contaning N samples:
>
What are typical values of N and K?
> Now from this data, I need to calculate to statistical "results". The
> first is average Sum, and the second is the "Peak" Sum.
Do you need to evaluate Average and Peak just once, or do you want to
repeat this process many times? When repeating, how much information
will carry over from one "experiment" to another?
> The Peak Sum, however, is a problem. The "Peak" Sum is supposed to
> represent something like the 80th percentile value of the Sums in the
> sample population. (Actually, the number could be made higher, like
> 95%, if a can still get sufficient accuracy). So how do I calculate
> this from my data?
This is definitely a strange number. I would seriously ask why this
number is needed. How will it be used next? I suspect the answer is
something like "We need to design a process which will work for
everything but extreme cases, so we want to know the value we can
design for where it will only fail 20% (or maybe 5%) of the time."
In general, this is bad design and bad statistics. A much better
approach is to understand the cost of improving the design and the cost
of failure. Then use the statistics available to estimate the
probability of failure and thus the expected cost of the design.
Improve the design until the marginal cost of improvement equals the
marginal cost of failure.
The statistics I will give below will actually allow you to look at
either probability of failure or Peak Sum.
Start by computing the mean and variance for each of the K items.
Compute the mean and variance of the sum using
Mean of Sum = SUM Mean( Item I ) I = 1 ... K
Variance of Sum = SUM Variance( Item I ) I = 1 ... K
The first is an equation you already had, and the second is a similar
equation, obtained by a similar proof. These equations require that
the Items be statistically independent. In practice, they sometimes
fail, particularly for very low probability events, because the Items
are slightly dependent on each other. You should look carefully at
your real problem to be sure the items are independent.
Now you would like to use the Central Limit Theorem, which says that
the probability distribution of the Sum will be approximately normal.
This will be a good approximation if K is large, say over 30. It may
also be a good approximation if K is small, but the probability
distributions of the items are well behaved: a single hump and rapidly
decreasing tails. Check this before you go further.
If the distribution of the sum is normal, then you know all about it
once you know its mean and variance. You can use a table of the normal
distribution to answer questions like
what is the probability of a sum greater than x?
what is the x such that the probability of a sum greater than x
is exactly 20%?
Given typical cost functions, you can also answer questions like
what is the probability of failure for this design?
what is the expect cost of failure for this design?
what is the marginal decrease in cost for improving this design?
If the distribution of the sum is not normal, then all is not lost, but
you need to obtain this distribution and work with it directly. Even
if it is not normal, it may have a recognized form, like Gamma or
Weibull. In this case, you may want to understand the real situation
producing the Item distributions, and use that understanding.
|
1184.4 | there are N sums to compute | PULSAR::WALLY | Wally Neilsen-Steinhardt | Tue Jan 23 1990 12:47 | 9 |
| re: <<< Note 1184.1 by KOBAL::GILBERT "Ownership Obligates" >>>
> Compute the K sums in O(N*K) time. Find the 80th percentile by (say)
> sorting the K sums in O(K*logK) time, and count to the 80th percentile.
> That's certainly not as bad as O(N**K).
A minor error, if I understand you. There are N sums to compute, doing
it your way, because the sum runs from 1 to K, and there are N samples
available.
|
1184.5 | Answers & more questions... | DWOVAX::YOUNG | If it bleeds, we can kill it. | Tue Jan 23 1990 23:51 | 47 |
| Re .1:
Wally is right in .4, Peter. The individual components items are
different and cannot really be compared. It is the sum of those items
that is important, and the resultant sums that I was talking about
sorting.
Re .2:
Sounds interesting. So what is P� ?
Re .3:
> This is definitely a strange number. I would seriously ask why this
> number is needed. How will it be used next? I suspect the answer is
> something like "We need to design a process which will work for
> everything but extreme cases, so we want to know the value we can
> design for where it will only fail 20% (or maybe 5%) of the time."
> In general, this is bad design and bad statistics. A much better
Sorry, Wally. It is not really anything like that. If it helps to
think of it like this, thats fine, in which case I would have to say
that the process is already cast in stone and we cannot change the
design. We do need to dynamically monitor what it is doing however.
To answer some of your other questions:
Typical values of N are 4 to 8.
Typical values of K are 20 to 60.
We need to evaluate the Average and Peak just once for each
sample population, but there are MANY sample populations, and more
being created all of the time, making speed important. No data
carries over from one sample population to the next.
The distribution of the samples within the individual is mostly,
but not 100%, independent of the other Items. Distribution of
samples within an Item is well-behaved (ie. "Normal-Like").
If I remember correctly, Variance is just the square of the
Standard Deviation, right?
As for looking up the normal distribution points in a table, I would
rather use an approximation formula, can anyone suggest one?
-- Barry
|
1184.6 | P� in more detail | VMSDEV::HALLYB | The Smart Money was on Goliath | Wed Jan 24 1990 11:30 | 22 |
| > Sounds interesting. So what is P� ?
Developed by Raj ERLANG::JAIN, it's an algorithm for estimating
percentiles (actually, histograms) in real time. You tell it
"OK, I've got N buckets. Here's observation I." for I = 1..K.
Note that only 2N longwords need be stored; N for the bucket
boundaries (or is it N+1) and N for the count of elements in
each bucket.
As each observation is encountered the algorithm modifies the
bucket boundaries so that the final result is close to what you
would get if you stored and sorted everything. It thus runs in
O(K) time.
Sounds almost hocus-pocus, but WAS published in CACM, attacked
and well-defended in letters to the editor.
PL/I source code is available from Raj. It's not long, and should
be easily recodable into FORTRAN, C, etc. Or take the MACRO output
from the PL/I compiler, if you have a really thrifty customer...
John
|
1184.7 | Comments | CADSYS::COOPER | Topher Cooper | Wed Jan 24 1990 12:16 | 63 |
| RE: .1, .4, .5 (Cost of sort and count)
The point still stands. The cost of doing it this way should be
O(N*k + N*logN)
which is considerably less than the O(N^K) mentioned. Given the
typical sizes of N and k given, this is a better result than mentioned
in .1.
This would be the correct method to use if you knew nothing about the
distribution of the K sums, i.e., if you could not guarentee their
conformance to normality or any other well behaved distribution.
Given rough normality, Wally's method is both faster and more accurate.
RE: .5
> As for looking up the normal distribution points in a table, I would
> rather use an approximation formula...
That can only slow down your application. If you treat your value
as strictly normally distributed, then you only need a *single* value
for the application (two if you calculate *both* the 80% and the 95%
figure). Given the mean sum M, and the variance of the sums V, your
peak is simply:
P[80%] = M + SQRT(V)*0.84
or
P[95%] = M + SQRT(V)*1.645
(but see below)
> If I remember correctly, Variance is just the square of the Standard
> Deviation, right?
Basically, although technically its the other way around: in general
the Standard Deviation is defined to be the square-root of the
Variance.
However:
The normal distribution should not be used here. It should only be
used when the standard deviation is known in advance, rather than
estimated by looking at the sample. What should be used is the
"t" distribution with N-1 degrees of freedom. The formual above
then becomes:
P[80%] = M + SQRT(V)*T80[N-1];
where T80 is an appropriately initialized vector. It is probably
simpler to type in the values needed than to include a formula,
especially if you can set a reasonable upper limit on N like 10.
It might be better (depending on details that are not too clear to
me in your situation) to take each sum as representing an estimated
true sum, which would therefore contribute variance to the overall
estimate. I would have to think about how to implement that, however,
and since I don't know that it is necessary, I won't bother.
Topher
|
1184.8 | a few more answers and questions | PULSAR::WALLY | Wally Neilsen-Steinhardt | Wed Jan 24 1990 12:51 | 107 |
| Re: <<< Note 1184.5 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>>
> Sorry, Wally. It is not really anything like that. If it helps to
> think of it like this, thats fine, in which case I would have to say
> that the process is already cast in stone and we cannot change the
> design. We do need to dynamically monitor what it is doing however.
No apology needed. I guessed based on my experience with people who
use percentiles, but I guessed wrong here. I am still not clear on
what you are doing and why you want this number. So let me take
another guess, and you can tell me where I go wrong. Note that this is
important. Nobody can give you good statistical advice unless they
understand the context.
You have a process of some kind which generates a batch of N samples,
each sample containing K items. You need to monitor the process in
real time and react to it somehow. You expect some large number M of
batches, where M is over 100. The statistic of most interest to you is
the sum of values of the K items within each sample. You are
concerned about one tail of this distribution.
Assuming I am right so far, I have a few questions.
Can you express "what it is doing" as a state of the process?
Can you define the actions you would take, based on the state of
the process: continue monitoring, discard output, shut it down,
adjust it, run like hell, and so forth?
Can you create a cost, payoff or loss table which shows the cost of
every action in every state of the process?
Can you relate your statistics, like Average Sum and Peak Sum, to
the state of the system?
Can you derive decision rules which say take this action if the
statistics fall within this range?
The above questions reflect the usual way of approaching problems like
this: Define the states, the actions, the payoff table, the relation
of samples and statistics to the states, and the decision rules. Even
if you don't want to be formal or detailed about this, it is a good
structure for understanding what you are doing.
> Typical values of N are 4 to 8.
This is rather a small number. If this reflects a significant
limitation on your sampling, like cost, then you want to substitute the
Student's t distribution for the normal distribution in all further
discussion of the sampling distribution. You usually won't have enough
information to make a good estimate of the mean and variance.
> Typical values of K are 20 to 60.
This is a nice large number. In combination with your later comment
about the niceness of the item distributions, this is probably enough
to ensure that the underlying distribution of Sum is approximately
normal.
> We need to evaluate the Average and Peak just once for each
> sample population, but there are MANY sample populations, and more
> being created all of the time, making speed important.
So, in my terms above M is a large number?
> No data carries over from one sample population to the next.
You may want to look more closely at this. The question is what
knowledge carries over, not what data. Assuming that there is a single
physical process behind all this, then some knowledge is likely to
carry over. In many cases it can be shown that the form of the
probability distributions is constant. In some cases it can be shown
that the variance is relatively constant, and the mean varies from
sample to sample. In other cases, the mean is constant and the
variance changes with time. This is worth looking closely at, since
this carry-over knowledge is really the basis of whatever statistical
analysis you are doing.
> The distribution of the samples within the individual is mostly,
> but not 100%, independent of the other Items.
This may present a risk. By the rules of statistical analysis, you
should just look at the N sum-over-item values you get from each batch
of samples. You should do nothing at all with the individual item
values. And the rules for means and variances I gave in my earlier
reply do not apply here. If you apply those rules, you will get
different (probably smaller) tails than really exist. Since you are
looking at peak behavior, you want the tails to be correct.
If you can convince yourself that the dependence within items is "small
enough" they you could still use the normal distribution.
> If I remember correctly, Variance is just the square of the
> Standard Deviation, right?
Right.
> As for looking up the normal distribution points in a table, I would
> rather use an approximation formula, can anyone suggest one?
There is another topic (maybe several) with pointers to approximations
to the normal aka gaussian distribution. Note that depending on what
you want to do, you may need only one point, which you can hard code
into your program: 5% of the area of a normal distribution lies above
the mean plus 1.645 times the standard deviation. Note that the normal
distribution may not be the one you need, based on the small size of N.
|
1184.9 | | DWOVAX::YOUNG | If it bleeds, we can kill it. | Fri Jan 26 1990 00:47 | 31 |
| Re .7:
OK, I now understand what Peter and you are talking about. However, N
seems to be too small in my case for this to be a good approach, 4 to 6
points do not seem to be much too work with when searching for an 80th
percentile (let alone a 95th).
And, yes, after I wrote my last note I did figure out that I only
needed one point on the Normal Distribution Table to make the
translation.
Students T distribution is new to me, I am going to look over it in my
Probability & Statistics books. I'll be back if I have any questions.
Re .8:
Hmmm... Well, I don't think that I can really answer most of your
questions Wally. But trying to answer them has given me a lot to think
about.
What do we do when our numbers get really bad? Basically we start
examining in more detail trying to find the cause and correct, if we
can. We never shut it down during regular operations, because we are
supplying "product" to other processes, and no matter how bad our
product is, it is still better than none at all.
Right now, I am leaning towards "Wally's Method", but I do have a lot
to think about here. I'll let you know when I come to a conclusion, or
have more questions.
-- Barry
|
1184.10 | two simple approaches | PULSAR::WALLY | Wally Neilsen-Steinhardt | Fri Jan 26 1990 12:21 | 37 |
| Re: <<< Note 1184.9 by DWOVAX::YOUNG "If it bleeds, we can kill it." >>>
> What do we do when our numbers get really bad? Basically we start
> examining in more detail trying to find the cause and correct, if we
> can. We never shut it down during regular operations, because we are
> supplying "product" to other processes, and no matter how bad our
> product is, it is still better than none at all.
This is beginning to sound a lot like it needs a traditional control
chart. Analyze a whole bunch of historical data to create control
charts for mean and standard deviation. You can use the standard rules
of thumb or your thinking about .8 to set the control limits. When you
get a sample, take your N sums, compute a mean and standard deviation
from those numbers, and plot them on the control charts. When the
charts tell you the process is out of control, start your detailed
examination. There is a lot of literature on this, usually involving
words like statistical quality control or industrial process control.
It generally does not assume much statistical knowledge in the reader.
Another approach is two-dimensional, and allows for the interaction of
mean and standard deviation. For each sample of N sums, plot its
mean on the x axis and its standard deviation (or variance) on the y
axis. Now divide the plane into green zone, yellow zone and red zone.
The dividing lines will probably run from upper left to lower right,
with green in lower left and red in upper right. You can draw the
lines based on your historical data, gut feel and/or the answers to
questions in .8. It is possible to calculate formally the position of
the zones, but the results are often no better than gut feel. When a
sample falls in the green zone, do nothing. When it falls in the
yellow zone, start to watch carefully. When it falls in the red zone,
start your detailed examination.
Why use the standard deviation instead of the Peak? There are rather
obscure arguments which suggest that mean and standard deviation
together do a better job of summarizing small samples drawn from
roughly normal distributions than alternatives, like peak, median or
range.
|