[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

869.0. "thresholding" by VINO::JMUNZER () Mon May 02 1988 18:58

Suppose you want to threshold a computer error.  Every time you see the
error, you want to see if you've gotten five similar errors in the last
two seconds -- or some such numbers.  (If you've seen too many errors in
too short a time interval, you want to kick some hardware out of the system.)

One way to do it is to keep timestamps for the last four errors.  Then you
can always answer the thresholding question (see note.)  But if there are a
lot of different errors, and their thresholds require more than five repeated
errors, that could be a lot of timestamps.

Anybody got a way to save less information and get a pretty good approximation
of the desired thresholding method?

Thanks,
John

 **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  ** 

NOTE:	On error:	if (time_now - time_1) < 2_seconds
				YOU'RE_OVER_THRESHOLD
			else
				YOU'RE_UNDER_THRESHOLD
				time_1 = time_2
				time_2 = time_3
				time_3 = time_4
				time_4 = time_now
T.RTitleUserPersonal
Name
DateLines
869.1Close enough.SNO78C::BRADLEYG&#039;day - Mark.Tue May 03 1988 01:3632
You could store a time and a count of errors and approximate this result.

You just let the machine assume that the number of errors you have recorded 
were evenly distributed between the recorded time and the time now.

try this pseudocode:-

cnt := 0
time := 0

start of error routine:

if cnt = 0 then    <----------------------------
    cnt := 1                                   |
    time := now                                |
else                                           |
    if now - time > period of interest then    |
        time := time + (now - time) / cnt;     |
        cnt := cnt - 1                         |
        go -------------------------------------
    else
        cnt := cnt + 1

if cnt > number then error count exceeded


This would not give exact results but would use much less storage.
If you are storing 8 byte date time stamps then the obvious method would 
use 32 bytes.

This way you use 8 bytes + counter (and the counter need only be a few 
bits).
869.2Record (start-time, error-count) pairsPOOL::HALLYBYou have the right to remain silent.Tue May 03 1988 11:4740
    I'm trying to solve the same problem and have a couple ideas.
    Perhaps this is equivalent to .1, but I prefer to develop solutions
    that acknowledge that errors tend to occur in bursts, not uniformly.
    
    (1) Save ABSTIM tics (4 bytes) instead of a VMS timestamp.  Cleaner,
    	and still convertible to absolute date/time if you need it.
    
    (2) More to the point, let's say we observe a disk error at time
    	t = 123456 (tics).  We keep only 1 entry for this entity, viz:
    
    		123456  1	(anywhere from 4 to 8 bytes, depending
    				 on how you pack it).
    
        When a second error occurs, we look to see if it is in the same
    	time-threshold, say 2 seconds or 200 tics.  Suppose ABSTIM is now
    	124816.  Since 124816 - 123456 = 1360 > 200, we "discard" the
    	prior data and save only the most recent:
    
    		124816  1	Only 1 error in this period of interest
    
    	Suppose an error occurs very soon after, at 124832.  We now have:

    		124816	2	2 errors, no update to time base
    
    	And errors continue to show up at 124847, 124900, 124909 when
    	we have:
    
    		124816	5	5 errors, 124909 - 124816 = 93 < 200
    
    	At this point you have had 5 errors in the last 930 milliseconds,
    	which indicates a threshold has been crossed.  Issue a CCCL
	("Complain and clear counters longword", try saying that 5 times
    	in 2 seconds!") and start over again.
    
    	What you are doing here is "starting" a 2-second (or whatever)
    	interval on the first new error.  Now if you get say 4 errors
    	in .1 seconds, your error rate is 40/sec but this method won't
	detect that unless you get a fifth error.  This is a feature.
    
          John
869.3SSDEVO::LARYTue May 03 1988 14:1526
One technique that uses very little storage is the "exponential decay"
threshold. You can do this in as little as one or two bytes per event type.
It goes something like this:

Initialize a variable V to zero.

Every time the event you are thresholding occurs, add a constant C to V.

Every second (or any other appropriate period) multiply V by an aging
factor 0 < A < 1.

If V ever exceeds some value T (T > C), declare a problem.

The action of this algorithm is very similiar to a circuit breaker - it
will trip on a very high occurrence rate for a short time, or a somewhat lower
occurrence rate over a longer time. Below an occurrence rate of T(1-A)/C
events per period it will never trip.

The idea behind this is reasonable - events acquire less significance as
they receded into the past - but many people don't like this method because
its hard to put the meaning of the threshold into English, and so you can
have trouble picking the values of A, C, and T that do what you want.
It also requires some compute time in the absence of any events to do
the aging, and to keep it numerically accurate you want to keep the period
small enough (and A large enough) that T >> C. I would only use it if saving
memory space was real important.
869.4CLT::GILBERTTue May 03 1988 15:1128
    Replies 869.2 and 869.3 are actually fairly similar, especially
    if you defer Richie's aging process until an event actually occurs.

    For each kind of event E, you have two pieces of dynamic information:

	E.v	value or variable
	E.t	time associated with the value

    and also some less dynamic information:

	E.b	a 'bound' or threshhold
	E.f	a function, which incorporates the A and C constants in .3,
		and/or the interval in .2.

    When an event occurs, you compute:

	E.v' <- 1 + E.f ( E.v, (current_time - E.t) )
	E.t' <- current_time	(actually 869.2 leaves this unchanged if E.v>0)
	if E.t' > E.b then {report threshhold exceeded}

    The E.f function degrades the importance of the E.v value according
    to the size of second parameter, for example:

	E.f(a,b) = a/(K*b),
    or
	E.f(a,b) = max(0, a-R*b),
    or
	E.f(a,b) = a*(D**-(K*b))
869.5Some details on exponential decay.PBSVAX::COOPERTopher CooperTue May 03 1988 16:3781
    I got beaten to the punch but ...
    
    Imagine that you had an unlimited amount of special memory with the
    following property: when you put a (real) value in it, it decays
    exponentially, i.e., if you put a 1.0 in it, after some fixed amount
    of time HL, a read would get you .5, time HL later you would get .25
    and so on.  Now lets say that each time you get an error you set one
    of these special decaying memories to a value, I, and added it to
    the list of "active" values.  Then you look at the sum of all the
    values on the list and if it exceeds a threshold value, T, then
    you trigger an action.

    The characteristics of this are that an occasional error would have
    no effect -- their values would decay effectively to zero (relative
    to the threshold) and be lost.  Several errors in quick succession
    would, on the other hand quickly exceed the threshold and trigger
    an action.  A moderate rate of errors for a somewhat longer time
    period would also trigger an action.

    A method mathematically equivalent to this technique can be implemented
    in a number of different ways.

    The most general is to keep (for each kind of error) two values: LT
    (the last time-stamp) and V (the current value).  When an error occurs
    do the following:

	    DT := time_now - LT;
	    LT := time_now;
	    V := V*C1^DT + I;
	    if V > T then ...ACTION...

    "I" can be any real value, 1.0 being convenient.  C1 is a constant
    equal to 0.5^(1/HL).  Here "^" means raise to the power.  Note that
    we are raising a real value (C1) to what is potentially an integer
    power.  There are some pretty fast routines for doing this (see
    Knuth) directly, but it is not clear whether it is better (faster,
    easier, more convenient) for you to do this that way or to convert
    it to a floating point number and to use a general exponentiation
    library routine.  In the latter case, it is probably better to
    replace the third line above with:

	    V := V*e^(DT*C2) + I;

    Where C2 = log(C1) = -log(2)/HL, and e is the base of the natural
    logarithms.  The reason that this is preferred is that raising e to
    an arbitrary power is usually faster than raising an arbitrary
    value to an arbitrary power (the later is generally done by taking
    the logarithm-base-e of the value, multiplying by the power then
    raising e to the result.  Here we've simply "precompiled" the first
    part of this process).

    If your program structure includes a basic loop which can be treated
    as occurring in constant time much smaller than the expected time
    between errors (for any particular kind of error), then a simpler
    scheme can be used:

    LT is no longer needed, only V for each kind of error.  Each time
    through the loop multiply every V by a constant W (whose value I
    will discuss in a moment).  When an error occurs, simply add 1.0
    to the appropriate V and compare it to T.  That's all.

    Treat the average time through the loop as your unit of time, and
    assume HL (the half-life) is expressed in terms of it, i.e., HL
    is the number of times through the loop after which an error is
    discounted 50% in effect.  Now W is simply the same as C1 above,
    i.e., 0.5^(1/HL).

    How would you decide on HL and T?  Experience is best, but you
    can get a first cut by trying to answer the following questions.
    If a series of the same error occurred in very rapid succession,
    how many should occur before I trigger a response?  The answer
    is T.  The second question is a bit trickier:  Imagine that two
    errors occur one after the other (assuming that T>2.0) and then
    nothing happens for a while.  Then T-1 errors occur in rapid
    succession.  How long a time interval would you be willing to
    accept the initial blip as probably relevant/related to the later
    bigger shot of errors?  The maximum such time is HL.

    Hope this helps.

				    Topher
869.6How many & How often?SNO78C::BRADLEYG&#039;day - Mark.Tue May 03 1988 21:4016
Depending on how often the events occur and how much CPU you are willing to 
use you need only keep one time stamp using the decay method above. This 
would force you to update all values every time you updated any one so the 
overhead would be extremely high for frequent errors.

At each invocation the error checking code would decay all error values 
by a an amount proportional to the time since the last invocation, then 
increment and check the appropriate one.

This would be reasonable for low error rates, or if the number of different 
errors being tracked was small enough. But I suspect the latter is not the 
case or you wouldn't have a memory problem.

Cheers,

Mark.
869.7Lowpass filter analogy16184::ROTHIf you plant ice you&#039;ll harvest windWed May 04 1988 09:0016
    An alternative useful point of view could be that of "low pass filtering"
    of a sequence of impulses (events).  The exponential decay method could
    be represented as a simple RC lowpass filter; since the output of the
    filter increases upon application of an impulse, it is only necessary
    to update a simulation of the filtering when any event occurs.

    Though nothing really new is added to the other replies by this analogy,
    it may give another source of ideas on what an appropriate filtering
    function may be to fit the situation.  This is an old problem in the
    design of communications gear, in the form of 'noise blankers' and the
    like.

    A fast table lookup (requiring memory, alas) could speed the function
    evaluations.

    - Jim
869.8a rule based on Bayes' TheoremPULSAR::WALLYWally Neilsen-SteinhardtFri May 06 1988 10:36205
There's a lot of rules which can be put into the form given in .4.
I'd like to present one rule, give a general way of evaluating rules, 
and then show why the one I have given is a good one, although not 
necessarily the best.


The rule (using the notation of .4):

	on startup, assign a constant to E.v, and current time (in tics) 
		to E.t
	on a failure, subtract elapsed time from E.v, maximize with 
		zero, and add a constant.  This means that E.f is of the 
		form:

			E.f( a, b ) = max ( 0, a - b ) + c

	if E.v is greater than a threshold, CCCL (the instruction 
		defined in .2)


Evaluating rules:

There are two general approaches to evaluating rules: craft tradition 
and analysis.  Craft tradition often gives excellent results, in which 
case analysis is superfluous.  But since the question is asked in .0, 
I'll assume that craft tradition is inadequate here, and look to 
analysis.

Decision theory is the basis for this analysis (_Statistics_, Winkler 
and Hays, Holt, NY, 1975).  We would like to minimize the costs of 
making the decision, and also the cost of making errors.  The goal in 
this problem is to minimize the sum of the following costs, all measured 
per unit time to keep them comparable:

	computation
	storage
	delay while statistics accumulate
	incorrect shutdown signals
	incorrect stayup signals

In prinicple, we can evaluate any rule by determining its costs in the 
four categories above.  Then the best rule is the one with the lowest 
cost. 

A good rule should have a few other characteristics:

	It should have parameters which allow us to tailor it to 
	specific situations.

	It should provide some guidance to setting the parameters based 
	on our understanding, to avoid pure guesswork.

	It should be easy to understand and to explain.

In practice, it is often easier to look for a rule which meets 
the criteria above and allows us to estimate the cost.

So I'll derive a rule from decision theory and show that it is almost 
identical to that given above.

Assume that we are monitoring a unit which may exist in two states GOOD 
and BAD.  In state GOOD it shows a low but non-zero probability of 
failure.  In state BAD it shows a high but non-unit probability of 
failure.  This analysis can be extended to many states but it is messy.

Assume further that we can take two actions: NOP or CCCL.  NOP just 
means continue using the unit and monitoring, CCCL means take it off 
line and hoist the repair flag.

We can tabulate the costs in the following table:

			GOOD	BAD

		NOP	0	CBAD
		CCCL	CGOOD	0

CBAD is the cost of using the unit even though it is bad, and CGOOD 
is the cost of not using the unit when it is still good.  If P(S) 
is the probability that the unit is in state S, and EC(A) is the 
expected cost of the action A, then

	EC(NOP) = 0 * P(GOOD) + CBAD * P(BAD)
	EC(CCCL) = CGOOD * P(GOOD) + 0 * P(BAD)

At a given time, we want to take the action which has the lowest 
expected cost.  At the crossover point, the two expected costs are 
equal, which leads to the inequality of ratios

	P(BAD) 		CGOOD
	------	>	-----
	P(GOOD)		CBAD

and the decision rule that we take action NOP unless this inequality is 
satisfied.

Now assume that we observe the system and that the outcome of our 
observation has two states: SUCCESS and FAILURE.  Again we can 
generalize to more than two states, but I'll pass on that.  Then the 
observation can be used to modify the P(S) using Bayes' Theorem:


	P(BAD|SUCCESS) 		P(SUCCESS|BAD)		P(BAD) 
	--------------	=	--------------	*	------
	P(GOOD|SUCCESS) 	P(SUCCESS|GOOD)		P(GOOD) 


	P(BAD|FAILURE) 		P(FAILURE|BAD)		P(BAD) 
	--------------	=	--------------	*	------
	P(GOOD|FAILURE) 	P(FAILURE|GOOD)		P(GOOD) 


and these formulas can be used iteratively: every time you get a success 
use the first formula to recompute the probabilities, and every time you 
get a failure, use the second formula.

The iteration needs to be bounded, so that a long string of successes 
does not convince us that the device is perfect.  We put a limit on

	P(BAD| long string of successes )
	---------------------------------
	P(GOOD| long string of successes )

to be the probability ratio indicating the unit will go BAD, even though 
it has been GOOD up to now.

We start the iteration at 

	P(BAD| startup )
	----------------
	P(GOOD| startup )

which is the probability ratio indicating that the unit is BAD on startup.

We terminate the iteration when the probability ratio satisfies the 
inequality above.

This gives us a good rule, except that it requires a lot of 
multiplication and forces us to deal some with very small numbers.  
We can convert to addition and deal with reasonable size numbers by 
taking the logarithms of everything in sight.  We can then take 
advantage of the fact that our basic inequality is unaffected by a 
linear transform

	x' = d * x + e

of the logarithms.  We choose e so that 

	P(BAD| long string of successes )
	---------------------------------
	P(GOOD| long string of successes )

has the value zero, and we choose d so that 

	P(SUCCESS|BAD)	
	--------------	
	P(SUCCESS|GOOD)	

has the value of -1.  This allows us to record a success by just 
decrementing a counter.  But we can save work by just saving the old 
time in ticks and then when a failure occurs, subtracting the elapsed 
ticks all at once.

Now observe that the rule at the top of this note is exactly the rule we 
have derived.  We can give a real meaning to each of the parameters in 
the rule above.  The threshold value is just the linear transform of the 
logarithm of 

		CGOOD
		-----
		CBAD

The constant c is just the linear transform of the 
logarithm of 

		P(FAILURE|BAD)	
		--------------	
		P(FAILURE|GOOD)		

minus one, if I understand the notation of .4 correctly.

Now I will argue that the rule given above is a good one.  It uses 
little memory, about 8 bytes, although .2 uses 5 or 6.  It uses little 
computation per failure, a bounded subtraction and an addition and a 
compare per failure.  It has little delay while failures accumulate.  
The size of the delay depends on the ratios above, but probably 
corresponds to about five failures, in a realistic situation.  By the 
derivation, the cost of incorrect signals has been minimized.  It has 
parameters which can tailor it to a situation, and the parameters can be 
chosen with a minimum of guesswork.  It is fairly easy to explain and 
understand.

I think there is a way to strengthen the argument above, by proving that 
the statistic is a complete summary of the relevant information but I 
don't remember how to do that.  Anyway, I think it requires assuming a 
Poisson process, and as .2 says, errors really come in bursts.

This rule is also closely related to the Sequential Probability Ratio 
Test of Abraham Wald, although his reasoning process was different.

Finally, note that all these notes have assumed that failures per time 
is a relevant statistic.  In some cases, failures per trial (per packet 
sent, per controller command, per memory reference, or whatever) may be 
better.  This approach can be applied to either situation, although 
failures per trial requires that we decrement a counter bounded by zero, 
once for each success.
869.9Periodic sampling vs. real-time processingPOOL::HALLYBLookit all the happy creatures dancin&#039; on the lawnMon May 09 1988 14:5139
    .8 looks like a really good approach.  Thanks, Wally!
    
    Looking things over, I note that there are two different ways one
    might observe errors:
    
    [1] By catching them on-the-fly, say within the device driver, and
    	running the .8 algorithm on the spot.  Very important for certain
    	real-time devices.
    
    [2] By periodic inspection of error counts, from a layered product.
    
    I wonder if .8 is sensitive to which approach is used.  It would
    seem not to matter too much.  From a practical standpoint, option
    [2] is really the only one to use "systemwide", inasmuch as device
    drivers are written by disparate groups and have disparate error
    characteristics, often unknown when the driver is first written.

    If one periodically inspects the error count (say every t1 units)
    and observes 0 errors, then everything is fine ([1] == [2]).
    
    If one observes N > 0 errors per interval, then arbitrarily assign
    event times uniformly within the interval and evaluate E.v N times,
    once for each error. 
    
      (a) If E.v is greater than the error threshold, CCCL.
    
      (b) If E.v is less than the error threshold, and N > 1,
	  then reduce the sample interval t1 to something shorter.
    	  The notion here is that you MAY have had a burst above
    	  the acceptible values, but because of a long t1 you did not
    	  get a chance to see it.  This is the case where [1] is more
    	  accurate than [2].  Reduce your sampling interval to obtain
    	  a more accurate estimate of the time of each error.

      This leaves open the question of what the minimum sampling interval
      should be, but if errors are THAT frequent you might want to cluster
      errors into groups of M instead of trying to deal with them individually.

        John
869.10fixed time sampling actually simplifiesPULSAR::WALLYWally Neilsen-SteinhardtTue May 10 1988 13:4658
re: < Note 869.9 by POOL::HALLYB "Lookit all the happy creatures dancin' on the lawn" >
                -< Periodic sampling vs. real-time processing >-

    The approach [1] has one big advantage, the minimum delay time 
    between when the device goes BAD and when CCCL is executed.  This
    advantage might be enough to motivate DEC to find a way to exploit
    it.  For example, the device driver may respond to an error by calling
    back to the exec which has the code and tables to compute .8.  And
    some facility could be provided to tune these tables as error behavior
    became known.
    
    If we use method 2, then the same ideas apply but the math is
    different.  It is roughly equivalent to that suggested in .9, but
    somewhat simpler.  And we have to face up to our ignorance of the 
    burstiness of the errors.
    
    We need to replace the fractions in .8 with fractions like
    
    	P( k | BAD )
    	------------
    	P( k | GOOD )
    
    where k is the error count in the interval, and is allowed to run
    from 0 up to n.  If the error count is greater than n, assume the
    device is BAD and CCCL.  The value of n can be computed from the
    cost fraction and the lower bound on the fraction

    	P( BAD )
    	--------
    	P( GOOD )
    
    So the program keeps a table of the scaled logarithms of 

	P( k | BAD )
    	------------
    	P( k | GOOD )
    
    (some positive and some negative) uses the actual count to index
    into the table, and adds the value as before.  Now there is no 
    separate subtraction for time, because all time intervals are the 
    same, so it is built into the fraction above.
    
    All the work I've seen calculates these fractions by assuming a
    Poisson process, with no burstiness.  If you really want to include
    burstiness, you would have to observe the device for a while and
    derive the fractions empirically.  Assuming a Poisson process, the
    program could calculate the fraction on the fly, but I'd guess that
    storing a table is better unless n is large.

>      This leaves open the question of what the minimum sampling interval
>      should be, but if errors are THAT frequent you might want to cluster
>      errors into groups of M instead of trying to deal with them individually.

    The interval is basically determined by the tradeoff between the
    cost of computing versus the cost of running with a BAD device.
    Rather than estimate these probably unknowable costs, I'd hazard
    a guess that the sampling interval should be chosen so that the
    whole error checking process took about 0.01% of CPU time.
869.11Fixed-time does seem workable19763::HALLYBLookit all the happy creatures dancin&#039; on the lawnWed May 11 1988 13:3726
>>      This leaves open the question of what the minimum sampling interval
>>      should be, but if errors are THAT frequent you might want to cluster
>>      errors into groups of M instead of trying to deal with them individually.
>
>    The interval is basically determined by the tradeoff between the
>    cost of computing versus the cost of running with a BAD device.
>    Rather than estimate these probably unknowable costs, I'd hazard
>    a guess that the sampling interval should be chosen so that the
>    whole error checking process took about 0.01% of CPU time.

    Yes, that seems reasonable.  The notion behind the grouping suggestion
    in .9 was that there are certain "errors" that actually should NOT be
    handled individually.  Various Ethernet events come to mind, as do
    magtape read errors.  These have a high cost of computing, owing to
    their frequency of occurrence.  On the other hand, errors on the
    system disk probably deserve a lot of analysis (is it the same block
    as some earlier error? Same read head? Critical file? etc.).  Ditto
    for correctible memory reads.
    
    If you take the assumption that most sample intervals will typically
    only require updating of a table entry, then you can calculate the
    sampling interval needed to take some arbitrarily small fraction of
    CPU time.  I think that an interval of about 10 seconds meets the
    0.01% figure from .10.  At least for my local 8550.

      John
869.12CLT::GILBERTWed May 11 1988 13:503
    I've either forgotten, or missed something.  Instead of sampling
    the numbers at regular time intervals, why not do the processing
    (and possibly the CCCL) when an error actually occurs?
869.13We're not writing the driver herePOOL::HALLYBLookit all the happy creatures dancin&#039; on the lawnWed May 11 1988 18:029
> Why not do the processing (and possibly the CCCL) when an error actually occurs?

    Because THAT code was written by somebody else, all you have is an
    binary device driver, and all you see is an error count occasionally.

    Or perhaps your device handles errors on its own and doesn't bother
    to tell you until you explicitly inquire about errors.

      John
869.14ThresholdingELWD2::CHINNASWAMYWed May 18 1988 16:148
    
    If the thresholding is used to identify a problem with the device,
    then along with the time usage must also be taken into account.
    I presume that the preceding discussions assume constant usage of
    the device.
    
    For instance, if you don't use a disk, you may not see an error.
    
869.15hidden humor in .14 made explicitCLT::GILBERTWed May 18 1988 18:462
    Q:  "My disk has a high error rate.  What should I do?"
    A:  "Do disk operations less frequently."