T.R | Title | User | Personal Name | Date | Lines |
---|
869.1 | Close enough. | SNO78C::BRADLEY | G'day - Mark. | Tue May 03 1988 01:36 | 32 |
| You could store a time and a count of errors and approximate this result.
You just let the machine assume that the number of errors you have recorded
were evenly distributed between the recorded time and the time now.
try this pseudocode:-
cnt := 0
time := 0
start of error routine:
if cnt = 0 then <----------------------------
cnt := 1 |
time := now |
else |
if now - time > period of interest then |
time := time + (now - time) / cnt; |
cnt := cnt - 1 |
go -------------------------------------
else
cnt := cnt + 1
if cnt > number then error count exceeded
This would not give exact results but would use much less storage.
If you are storing 8 byte date time stamps then the obvious method would
use 32 bytes.
This way you use 8 bytes + counter (and the counter need only be a few
bits).
|
869.2 | Record (start-time, error-count) pairs | POOL::HALLYB | You have the right to remain silent. | Tue May 03 1988 11:47 | 40 |
| I'm trying to solve the same problem and have a couple ideas.
Perhaps this is equivalent to .1, but I prefer to develop solutions
that acknowledge that errors tend to occur in bursts, not uniformly.
(1) Save ABSTIM tics (4 bytes) instead of a VMS timestamp. Cleaner,
and still convertible to absolute date/time if you need it.
(2) More to the point, let's say we observe a disk error at time
t = 123456 (tics). We keep only 1 entry for this entity, viz:
123456 1 (anywhere from 4 to 8 bytes, depending
on how you pack it).
When a second error occurs, we look to see if it is in the same
time-threshold, say 2 seconds or 200 tics. Suppose ABSTIM is now
124816. Since 124816 - 123456 = 1360 > 200, we "discard" the
prior data and save only the most recent:
124816 1 Only 1 error in this period of interest
Suppose an error occurs very soon after, at 124832. We now have:
124816 2 2 errors, no update to time base
And errors continue to show up at 124847, 124900, 124909 when
we have:
124816 5 5 errors, 124909 - 124816 = 93 < 200
At this point you have had 5 errors in the last 930 milliseconds,
which indicates a threshold has been crossed. Issue a CCCL
("Complain and clear counters longword", try saying that 5 times
in 2 seconds!") and start over again.
What you are doing here is "starting" a 2-second (or whatever)
interval on the first new error. Now if you get say 4 errors
in .1 seconds, your error rate is 40/sec but this method won't
detect that unless you get a fifth error. This is a feature.
John
|
869.3 | | SSDEVO::LARY | | Tue May 03 1988 14:15 | 26 |
| One technique that uses very little storage is the "exponential decay"
threshold. You can do this in as little as one or two bytes per event type.
It goes something like this:
Initialize a variable V to zero.
Every time the event you are thresholding occurs, add a constant C to V.
Every second (or any other appropriate period) multiply V by an aging
factor 0 < A < 1.
If V ever exceeds some value T (T > C), declare a problem.
The action of this algorithm is very similiar to a circuit breaker - it
will trip on a very high occurrence rate for a short time, or a somewhat lower
occurrence rate over a longer time. Below an occurrence rate of T(1-A)/C
events per period it will never trip.
The idea behind this is reasonable - events acquire less significance as
they receded into the past - but many people don't like this method because
its hard to put the meaning of the threshold into English, and so you can
have trouble picking the values of A, C, and T that do what you want.
It also requires some compute time in the absence of any events to do
the aging, and to keep it numerically accurate you want to keep the period
small enough (and A large enough) that T >> C. I would only use it if saving
memory space was real important.
|
869.4 | | CLT::GILBERT | | Tue May 03 1988 15:11 | 28 |
| Replies 869.2 and 869.3 are actually fairly similar, especially
if you defer Richie's aging process until an event actually occurs.
For each kind of event E, you have two pieces of dynamic information:
E.v value or variable
E.t time associated with the value
and also some less dynamic information:
E.b a 'bound' or threshhold
E.f a function, which incorporates the A and C constants in .3,
and/or the interval in .2.
When an event occurs, you compute:
E.v' <- 1 + E.f ( E.v, (current_time - E.t) )
E.t' <- current_time (actually 869.2 leaves this unchanged if E.v>0)
if E.t' > E.b then {report threshhold exceeded}
The E.f function degrades the importance of the E.v value according
to the size of second parameter, for example:
E.f(a,b) = a/(K*b),
or
E.f(a,b) = max(0, a-R*b),
or
E.f(a,b) = a*(D**-(K*b))
|
869.5 | Some details on exponential decay. | PBSVAX::COOPER | Topher Cooper | Tue May 03 1988 16:37 | 81 |
| I got beaten to the punch but ...
Imagine that you had an unlimited amount of special memory with the
following property: when you put a (real) value in it, it decays
exponentially, i.e., if you put a 1.0 in it, after some fixed amount
of time HL, a read would get you .5, time HL later you would get .25
and so on. Now lets say that each time you get an error you set one
of these special decaying memories to a value, I, and added it to
the list of "active" values. Then you look at the sum of all the
values on the list and if it exceeds a threshold value, T, then
you trigger an action.
The characteristics of this are that an occasional error would have
no effect -- their values would decay effectively to zero (relative
to the threshold) and be lost. Several errors in quick succession
would, on the other hand quickly exceed the threshold and trigger
an action. A moderate rate of errors for a somewhat longer time
period would also trigger an action.
A method mathematically equivalent to this technique can be implemented
in a number of different ways.
The most general is to keep (for each kind of error) two values: LT
(the last time-stamp) and V (the current value). When an error occurs
do the following:
DT := time_now - LT;
LT := time_now;
V := V*C1^DT + I;
if V > T then ...ACTION...
"I" can be any real value, 1.0 being convenient. C1 is a constant
equal to 0.5^(1/HL). Here "^" means raise to the power. Note that
we are raising a real value (C1) to what is potentially an integer
power. There are some pretty fast routines for doing this (see
Knuth) directly, but it is not clear whether it is better (faster,
easier, more convenient) for you to do this that way or to convert
it to a floating point number and to use a general exponentiation
library routine. In the latter case, it is probably better to
replace the third line above with:
V := V*e^(DT*C2) + I;
Where C2 = log(C1) = -log(2)/HL, and e is the base of the natural
logarithms. The reason that this is preferred is that raising e to
an arbitrary power is usually faster than raising an arbitrary
value to an arbitrary power (the later is generally done by taking
the logarithm-base-e of the value, multiplying by the power then
raising e to the result. Here we've simply "precompiled" the first
part of this process).
If your program structure includes a basic loop which can be treated
as occurring in constant time much smaller than the expected time
between errors (for any particular kind of error), then a simpler
scheme can be used:
LT is no longer needed, only V for each kind of error. Each time
through the loop multiply every V by a constant W (whose value I
will discuss in a moment). When an error occurs, simply add 1.0
to the appropriate V and compare it to T. That's all.
Treat the average time through the loop as your unit of time, and
assume HL (the half-life) is expressed in terms of it, i.e., HL
is the number of times through the loop after which an error is
discounted 50% in effect. Now W is simply the same as C1 above,
i.e., 0.5^(1/HL).
How would you decide on HL and T? Experience is best, but you
can get a first cut by trying to answer the following questions.
If a series of the same error occurred in very rapid succession,
how many should occur before I trigger a response? The answer
is T. The second question is a bit trickier: Imagine that two
errors occur one after the other (assuming that T>2.0) and then
nothing happens for a while. Then T-1 errors occur in rapid
succession. How long a time interval would you be willing to
accept the initial blip as probably relevant/related to the later
bigger shot of errors? The maximum such time is HL.
Hope this helps.
Topher
|
869.6 | How many & How often? | SNO78C::BRADLEY | G'day - Mark. | Tue May 03 1988 21:40 | 16 |
| Depending on how often the events occur and how much CPU you are willing to
use you need only keep one time stamp using the decay method above. This
would force you to update all values every time you updated any one so the
overhead would be extremely high for frequent errors.
At each invocation the error checking code would decay all error values
by a an amount proportional to the time since the last invocation, then
increment and check the appropriate one.
This would be reasonable for low error rates, or if the number of different
errors being tracked was small enough. But I suspect the latter is not the
case or you wouldn't have a memory problem.
Cheers,
Mark.
|
869.7 | Lowpass filter analogy | 16184::ROTH | If you plant ice you'll harvest wind | Wed May 04 1988 09:00 | 16 |
| An alternative useful point of view could be that of "low pass filtering"
of a sequence of impulses (events). The exponential decay method could
be represented as a simple RC lowpass filter; since the output of the
filter increases upon application of an impulse, it is only necessary
to update a simulation of the filtering when any event occurs.
Though nothing really new is added to the other replies by this analogy,
it may give another source of ideas on what an appropriate filtering
function may be to fit the situation. This is an old problem in the
design of communications gear, in the form of 'noise blankers' and the
like.
A fast table lookup (requiring memory, alas) could speed the function
evaluations.
- Jim
|
869.8 | a rule based on Bayes' Theorem | PULSAR::WALLY | Wally Neilsen-Steinhardt | Fri May 06 1988 10:36 | 205 |
| There's a lot of rules which can be put into the form given in .4.
I'd like to present one rule, give a general way of evaluating rules,
and then show why the one I have given is a good one, although not
necessarily the best.
The rule (using the notation of .4):
on startup, assign a constant to E.v, and current time (in tics)
to E.t
on a failure, subtract elapsed time from E.v, maximize with
zero, and add a constant. This means that E.f is of the
form:
E.f( a, b ) = max ( 0, a - b ) + c
if E.v is greater than a threshold, CCCL (the instruction
defined in .2)
Evaluating rules:
There are two general approaches to evaluating rules: craft tradition
and analysis. Craft tradition often gives excellent results, in which
case analysis is superfluous. But since the question is asked in .0,
I'll assume that craft tradition is inadequate here, and look to
analysis.
Decision theory is the basis for this analysis (_Statistics_, Winkler
and Hays, Holt, NY, 1975). We would like to minimize the costs of
making the decision, and also the cost of making errors. The goal in
this problem is to minimize the sum of the following costs, all measured
per unit time to keep them comparable:
computation
storage
delay while statistics accumulate
incorrect shutdown signals
incorrect stayup signals
In prinicple, we can evaluate any rule by determining its costs in the
four categories above. Then the best rule is the one with the lowest
cost.
A good rule should have a few other characteristics:
It should have parameters which allow us to tailor it to
specific situations.
It should provide some guidance to setting the parameters based
on our understanding, to avoid pure guesswork.
It should be easy to understand and to explain.
In practice, it is often easier to look for a rule which meets
the criteria above and allows us to estimate the cost.
So I'll derive a rule from decision theory and show that it is almost
identical to that given above.
Assume that we are monitoring a unit which may exist in two states GOOD
and BAD. In state GOOD it shows a low but non-zero probability of
failure. In state BAD it shows a high but non-unit probability of
failure. This analysis can be extended to many states but it is messy.
Assume further that we can take two actions: NOP or CCCL. NOP just
means continue using the unit and monitoring, CCCL means take it off
line and hoist the repair flag.
We can tabulate the costs in the following table:
GOOD BAD
NOP 0 CBAD
CCCL CGOOD 0
CBAD is the cost of using the unit even though it is bad, and CGOOD
is the cost of not using the unit when it is still good. If P(S)
is the probability that the unit is in state S, and EC(A) is the
expected cost of the action A, then
EC(NOP) = 0 * P(GOOD) + CBAD * P(BAD)
EC(CCCL) = CGOOD * P(GOOD) + 0 * P(BAD)
At a given time, we want to take the action which has the lowest
expected cost. At the crossover point, the two expected costs are
equal, which leads to the inequality of ratios
P(BAD) CGOOD
------ > -----
P(GOOD) CBAD
and the decision rule that we take action NOP unless this inequality is
satisfied.
Now assume that we observe the system and that the outcome of our
observation has two states: SUCCESS and FAILURE. Again we can
generalize to more than two states, but I'll pass on that. Then the
observation can be used to modify the P(S) using Bayes' Theorem:
P(BAD|SUCCESS) P(SUCCESS|BAD) P(BAD)
-------------- = -------------- * ------
P(GOOD|SUCCESS) P(SUCCESS|GOOD) P(GOOD)
P(BAD|FAILURE) P(FAILURE|BAD) P(BAD)
-------------- = -------------- * ------
P(GOOD|FAILURE) P(FAILURE|GOOD) P(GOOD)
and these formulas can be used iteratively: every time you get a success
use the first formula to recompute the probabilities, and every time you
get a failure, use the second formula.
The iteration needs to be bounded, so that a long string of successes
does not convince us that the device is perfect. We put a limit on
P(BAD| long string of successes )
---------------------------------
P(GOOD| long string of successes )
to be the probability ratio indicating the unit will go BAD, even though
it has been GOOD up to now.
We start the iteration at
P(BAD| startup )
----------------
P(GOOD| startup )
which is the probability ratio indicating that the unit is BAD on startup.
We terminate the iteration when the probability ratio satisfies the
inequality above.
This gives us a good rule, except that it requires a lot of
multiplication and forces us to deal some with very small numbers.
We can convert to addition and deal with reasonable size numbers by
taking the logarithms of everything in sight. We can then take
advantage of the fact that our basic inequality is unaffected by a
linear transform
x' = d * x + e
of the logarithms. We choose e so that
P(BAD| long string of successes )
---------------------------------
P(GOOD| long string of successes )
has the value zero, and we choose d so that
P(SUCCESS|BAD)
--------------
P(SUCCESS|GOOD)
has the value of -1. This allows us to record a success by just
decrementing a counter. But we can save work by just saving the old
time in ticks and then when a failure occurs, subtracting the elapsed
ticks all at once.
Now observe that the rule at the top of this note is exactly the rule we
have derived. We can give a real meaning to each of the parameters in
the rule above. The threshold value is just the linear transform of the
logarithm of
CGOOD
-----
CBAD
The constant c is just the linear transform of the
logarithm of
P(FAILURE|BAD)
--------------
P(FAILURE|GOOD)
minus one, if I understand the notation of .4 correctly.
Now I will argue that the rule given above is a good one. It uses
little memory, about 8 bytes, although .2 uses 5 or 6. It uses little
computation per failure, a bounded subtraction and an addition and a
compare per failure. It has little delay while failures accumulate.
The size of the delay depends on the ratios above, but probably
corresponds to about five failures, in a realistic situation. By the
derivation, the cost of incorrect signals has been minimized. It has
parameters which can tailor it to a situation, and the parameters can be
chosen with a minimum of guesswork. It is fairly easy to explain and
understand.
I think there is a way to strengthen the argument above, by proving that
the statistic is a complete summary of the relevant information but I
don't remember how to do that. Anyway, I think it requires assuming a
Poisson process, and as .2 says, errors really come in bursts.
This rule is also closely related to the Sequential Probability Ratio
Test of Abraham Wald, although his reasoning process was different.
Finally, note that all these notes have assumed that failures per time
is a relevant statistic. In some cases, failures per trial (per packet
sent, per controller command, per memory reference, or whatever) may be
better. This approach can be applied to either situation, although
failures per trial requires that we decrement a counter bounded by zero,
once for each success.
|
869.9 | Periodic sampling vs. real-time processing | POOL::HALLYB | Lookit all the happy creatures dancin' on the lawn | Mon May 09 1988 14:51 | 39 |
| .8 looks like a really good approach. Thanks, Wally!
Looking things over, I note that there are two different ways one
might observe errors:
[1] By catching them on-the-fly, say within the device driver, and
running the .8 algorithm on the spot. Very important for certain
real-time devices.
[2] By periodic inspection of error counts, from a layered product.
I wonder if .8 is sensitive to which approach is used. It would
seem not to matter too much. From a practical standpoint, option
[2] is really the only one to use "systemwide", inasmuch as device
drivers are written by disparate groups and have disparate error
characteristics, often unknown when the driver is first written.
If one periodically inspects the error count (say every t1 units)
and observes 0 errors, then everything is fine ([1] == [2]).
If one observes N > 0 errors per interval, then arbitrarily assign
event times uniformly within the interval and evaluate E.v N times,
once for each error.
(a) If E.v is greater than the error threshold, CCCL.
(b) If E.v is less than the error threshold, and N > 1,
then reduce the sample interval t1 to something shorter.
The notion here is that you MAY have had a burst above
the acceptible values, but because of a long t1 you did not
get a chance to see it. This is the case where [1] is more
accurate than [2]. Reduce your sampling interval to obtain
a more accurate estimate of the time of each error.
This leaves open the question of what the minimum sampling interval
should be, but if errors are THAT frequent you might want to cluster
errors into groups of M instead of trying to deal with them individually.
John
|
869.10 | fixed time sampling actually simplifies | PULSAR::WALLY | Wally Neilsen-Steinhardt | Tue May 10 1988 13:46 | 58 |
| re: < Note 869.9 by POOL::HALLYB "Lookit all the happy creatures dancin' on the lawn" >
-< Periodic sampling vs. real-time processing >-
The approach [1] has one big advantage, the minimum delay time
between when the device goes BAD and when CCCL is executed. This
advantage might be enough to motivate DEC to find a way to exploit
it. For example, the device driver may respond to an error by calling
back to the exec which has the code and tables to compute .8. And
some facility could be provided to tune these tables as error behavior
became known.
If we use method 2, then the same ideas apply but the math is
different. It is roughly equivalent to that suggested in .9, but
somewhat simpler. And we have to face up to our ignorance of the
burstiness of the errors.
We need to replace the fractions in .8 with fractions like
P( k | BAD )
------------
P( k | GOOD )
where k is the error count in the interval, and is allowed to run
from 0 up to n. If the error count is greater than n, assume the
device is BAD and CCCL. The value of n can be computed from the
cost fraction and the lower bound on the fraction
P( BAD )
--------
P( GOOD )
So the program keeps a table of the scaled logarithms of
P( k | BAD )
------------
P( k | GOOD )
(some positive and some negative) uses the actual count to index
into the table, and adds the value as before. Now there is no
separate subtraction for time, because all time intervals are the
same, so it is built into the fraction above.
All the work I've seen calculates these fractions by assuming a
Poisson process, with no burstiness. If you really want to include
burstiness, you would have to observe the device for a while and
derive the fractions empirically. Assuming a Poisson process, the
program could calculate the fraction on the fly, but I'd guess that
storing a table is better unless n is large.
> This leaves open the question of what the minimum sampling interval
> should be, but if errors are THAT frequent you might want to cluster
> errors into groups of M instead of trying to deal with them individually.
The interval is basically determined by the tradeoff between the
cost of computing versus the cost of running with a BAD device.
Rather than estimate these probably unknowable costs, I'd hazard
a guess that the sampling interval should be chosen so that the
whole error checking process took about 0.01% of CPU time.
|
869.11 | Fixed-time does seem workable | 19763::HALLYB | Lookit all the happy creatures dancin' on the lawn | Wed May 11 1988 13:37 | 26 |
| >> This leaves open the question of what the minimum sampling interval
>> should be, but if errors are THAT frequent you might want to cluster
>> errors into groups of M instead of trying to deal with them individually.
>
> The interval is basically determined by the tradeoff between the
> cost of computing versus the cost of running with a BAD device.
> Rather than estimate these probably unknowable costs, I'd hazard
> a guess that the sampling interval should be chosen so that the
> whole error checking process took about 0.01% of CPU time.
Yes, that seems reasonable. The notion behind the grouping suggestion
in .9 was that there are certain "errors" that actually should NOT be
handled individually. Various Ethernet events come to mind, as do
magtape read errors. These have a high cost of computing, owing to
their frequency of occurrence. On the other hand, errors on the
system disk probably deserve a lot of analysis (is it the same block
as some earlier error? Same read head? Critical file? etc.). Ditto
for correctible memory reads.
If you take the assumption that most sample intervals will typically
only require updating of a table entry, then you can calculate the
sampling interval needed to take some arbitrarily small fraction of
CPU time. I think that an interval of about 10 seconds meets the
0.01% figure from .10. At least for my local 8550.
John
|
869.12 | | CLT::GILBERT | | Wed May 11 1988 13:50 | 3 |
| I've either forgotten, or missed something. Instead of sampling
the numbers at regular time intervals, why not do the processing
(and possibly the CCCL) when an error actually occurs?
|
869.13 | We're not writing the driver here | POOL::HALLYB | Lookit all the happy creatures dancin' on the lawn | Wed May 11 1988 18:02 | 9 |
| > Why not do the processing (and possibly the CCCL) when an error actually occurs?
Because THAT code was written by somebody else, all you have is an
binary device driver, and all you see is an error count occasionally.
Or perhaps your device handles errors on its own and doesn't bother
to tell you until you explicitly inquire about errors.
John
|
869.14 | Thresholding | ELWD2::CHINNASWAMY | | Wed May 18 1988 16:14 | 8 |
|
If the thresholding is used to identify a problem with the device,
then along with the time usage must also be taken into account.
I presume that the preceding discussions assume constant usage of
the device.
For instance, if you don't use a disk, you may not see an error.
|
869.15 | hidden humor in .14 made explicit | CLT::GILBERT | | Wed May 18 1988 18:46 | 2 |
| Q: "My disk has a high error rate. What should I do?"
A: "Do disk operations less frequently."
|