[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

869.0. "thresholding" by VINO::JMUNZER () Mon May 02 1988 17:58

Suppose you want to threshold a computer error.  Every time you see the
error, you want to see if you've gotten five similar errors in the last
two seconds -- or some such numbers.  (If you've seen too many errors in
too short a time interval, you want to kick some hardware out of the system.)

One way to do it is to keep timestamps for the last four errors.  Then you
can always answer the thresholding question (see note.)  But if there are a
lot of different errors, and their thresholds require more than five repeated
errors, that could be a lot of timestamps.

Anybody got a way to save less information and get a pretty good approximation
of the desired thresholding method?

Thanks,
John

 **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  ** 

NOTE:	On error:	if (time_now - time_1) < 2_seconds
				YOU'RE_OVER_THRESHOLD
			else
				YOU'RE_UNDER_THRESHOLD
				time_1 = time_2
				time_2 = time_3
				time_3 = time_4
				time_4 = time_now

T.R	Title	User	Personal Name	Date	Lines
869.1	Close enough.	SNO78C::BRADLEY	G'day - Mark.	`Tue May 03 1988 00:36`	32
	You could store a time and a count of errors and approximate this result. You just let the machine assume that the number of errors you have recorded were evenly distributed between the recorded time and the time now. try this pseudocode:- cnt := 0 time := 0 start of error routine: if cnt = 0 then <---------------------------- cnt := 1 \| time := now \| else \| if now - time > period of interest then \| time := time + (now - time) / cnt; \| cnt := cnt - 1 \| go ------------------------------------- else cnt := cnt + 1 if cnt > number then error count exceeded This would not give exact results but would use much less storage. If you are storing 8 byte date time stamps then the obvious method would use 32 bytes. This way you use 8 bytes + counter (and the counter need only be a few bits).
869.2	Record (start-time, error-count) pairs	POOL::HALLYB	You have the right to remain silent.	`Tue May 03 1988 10:47`	40
	I'm trying to solve the same problem and have a couple ideas. Perhaps this is equivalent to .1, but I prefer to develop solutions that acknowledge that errors tend to occur in bursts, not uniformly. (1) Save ABSTIM tics (4 bytes) instead of a VMS timestamp. Cleaner, and still convertible to absolute date/time if you need it. (2) More to the point, let's say we observe a disk error at time t = 123456 (tics). We keep only 1 entry for this entity, viz: 123456 1 (anywhere from 4 to 8 bytes, depending on how you pack it). When a second error occurs, we look to see if it is in the same time-threshold, say 2 seconds or 200 tics. Suppose ABSTIM is now 124816. Since 124816 - 123456 = 1360 > 200, we "discard" the prior data and save only the most recent: 124816 1 Only 1 error in this period of interest Suppose an error occurs very soon after, at 124832. We now have: 124816 2 2 errors, no update to time base And errors continue to show up at 124847, 124900, 124909 when we have: 124816 5 5 errors, 124909 - 124816 = 93 < 200 At this point you have had 5 errors in the last 930 milliseconds, which indicates a threshold has been crossed. Issue a CCCL ("Complain and clear counters longword", try saying that 5 times in 2 seconds!") and start over again. What you are doing here is "starting" a 2-second (or whatever) interval on the first new error. Now if you get say 4 errors in .1 seconds, your error rate is 40/sec but this method won't detect that unless you get a fifth error. This is a feature. John
869.3		SSDEVO::LARY		`Tue May 03 1988 13:15`	26
	One technique that uses very little storage is the "exponential decay" threshold. You can do this in as little as one or two bytes per event type. It goes something like this: Initialize a variable V to zero. Every time the event you are thresholding occurs, add a constant C to V. Every second (or any other appropriate period) multiply V by an aging factor 0 < A < 1. If V ever exceeds some value T (T > C), declare a problem. The action of this algorithm is very similiar to a circuit breaker - it will trip on a very high occurrence rate for a short time, or a somewhat lower occurrence rate over a longer time. Below an occurrence rate of T(1-A)/C events per period it will never trip. The idea behind this is reasonable - events acquire less significance as they receded into the past - but many people don't like this method because its hard to put the meaning of the threshold into English, and so you can have trouble picking the values of A, C, and T that do what you want. It also requires some compute time in the absence of any events to do the aging, and to keep it numerically accurate you want to keep the period small enough (and A large enough) that T >> C. I would only use it if saving memory space was real important.
869.4		CLT::GILBERT		`Tue May 03 1988 14:11`	28
	Replies 869.2 and 869.3 are actually fairly similar, especially if you defer Richie's aging process until an event actually occurs. For each kind of event E, you have two pieces of dynamic information: E.v value or variable E.t time associated with the value and also some less dynamic information: E.b a 'bound' or threshhold E.f a function, which incorporates the A and C constants in .3, and/or the interval in .2. When an event occurs, you compute: E.v' <- 1 + E.f ( E.v, (current_time - E.t) ) E.t' <- current_time (actually 869.2 leaves this unchanged if E.v>0) if E.t' > E.b then {report threshhold exceeded} The E.f function degrades the importance of the E.v value according to the size of second parameter, for example: E.f(a,b) = a/(Kb), or E.f(a,b) = max(0, a-Rb), or E.f(a,b) = a(D-(Kb))
869.5	Some details on exponential decay.	PBSVAX::COOPER	Topher Cooper	`Tue May 03 1988 15:37`	81
	I got beaten to the punch but ... Imagine that you had an unlimited amount of special memory with the following property: when you put a (real) value in it, it decays exponentially, i.e., if you put a 1.0 in it, after some fixed amount of time HL, a read would get you .5, time HL later you would get .25 and so on. Now lets say that each time you get an error you set one of these special decaying memories to a value, I, and added it to the list of "active" values. Then you look at the sum of all the values on the list and if it exceeds a threshold value, T, then you trigger an action. The characteristics of this are that an occasional error would have no effect -- their values would decay effectively to zero (relative to the threshold) and be lost. Several errors in quick succession would, on the other hand quickly exceed the threshold and trigger an action. A moderate rate of errors for a somewhat longer time period would also trigger an action. A method mathematically equivalent to this technique can be implemented in a number of different ways. The most general is to keep (for each kind of error) two values: LT (the last time-stamp) and V (the current value). When an error occurs do the following: DT := time_now - LT; LT := time_now; V := VC1^DT + I; if V > T then ...ACTION... "I" can be any real value, 1.0 being convenient. C1 is a constant equal to 0.5^(1/HL). Here "^" means raise to the power. Note that we are raising a real value (C1) to what is potentially an integer power. There are some pretty fast routines for doing this (see Knuth) directly, but it is not clear whether it is better (faster, easier, more convenient) for you to do this that way or to convert it to a floating point number and to use a general exponentiation library routine. In the latter case, it is probably better to replace the third line above with: V := Ve^(DT*C2) + I; Where C2 = log(C1) = -log(2)/HL, and e is the base of the natural logarithms. The reason that this is preferred is that raising e to an arbitrary power is usually faster than raising an arbitrary value to an arbitrary power (the later is generally done by taking the logarithm-base-e of the value, multiplying by the power then raising e to the result. Here we've simply "precompiled" the first part of this process). If your program structure includes a basic loop which can be treated as occurring in constant time much smaller than the expected time between errors (for any particular kind of error), then a simpler scheme can be used: LT is no longer needed, only V for each kind of error. Each time through the loop multiply every V by a constant W (whose value I will discuss in a moment). When an error occurs, simply add 1.0 to the appropriate V and compare it to T. That's all. Treat the average time through the loop as your unit of time, and assume HL (the half-life) is expressed in terms of it, i.e., HL is the number of times through the loop after which an error is discounted 50% in effect. Now W is simply the same as C1 above, i.e., 0.5^(1/HL). How would you decide on HL and T? Experience is best, but you can get a first cut by trying to answer the following questions. If a series of the same error occurred in very rapid succession, how many should occur before I trigger a response? The answer is T. The second question is a bit trickier: Imagine that two errors occur one after the other (assuming that T>2.0) and then nothing happens for a while. Then T-1 errors occur in rapid succession. How long a time interval would you be willing to accept the initial blip as probably relevant/related to the later bigger shot of errors? The maximum such time is HL. Hope this helps. Topher
869.6	How many & How often?	SNO78C::BRADLEY	G'day - Mark.	`Tue May 03 1988 20:40`	16
	Depending on how often the events occur and how much CPU you are willing to use you need only keep one time stamp using the decay method above. This would force you to update all values every time you updated any one so the overhead would be extremely high for frequent errors. At each invocation the error checking code would decay all error values by a an amount proportional to the time since the last invocation, then increment and check the appropriate one. This would be reasonable for low error rates, or if the number of different errors being tracked was small enough. But I suspect the latter is not the case or you wouldn't have a memory problem. Cheers, Mark.
869.7	Lowpass filter analogy	16184::ROTH	If you plant ice you'll harvest wind	`Wed May 04 1988 08:00`	16
	An alternative useful point of view could be that of "low pass filtering" of a sequence of impulses (events). The exponential decay method could be represented as a simple RC lowpass filter; since the output of the filter increases upon application of an impulse, it is only necessary to update a simulation of the filtering when any event occurs. Though nothing really new is added to the other replies by this analogy, it may give another source of ideas on what an appropriate filtering function may be to fit the situation. This is an old problem in the design of communications gear, in the form of 'noise blankers' and the like. A fast table lookup (requiring memory, alas) could speed the function evaluations. - Jim
869.8	a rule based on Bayes' Theorem	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Fri May 06 1988 09:36`	205
	There's a lot of rules which can be put into the form given in .4. I'd like to present one rule, give a general way of evaluating rules, and then show why the one I have given is a good one, although not necessarily the best. The rule (using the notation of .4): on startup, assign a constant to E.v, and current time (in tics) to E.t on a failure, subtract elapsed time from E.v, maximize with zero, and add a constant. This means that E.f is of the form: E.f( a, b ) = max ( 0, a - b ) + c if E.v is greater than a threshold, CCCL (the instruction defined in .2) Evaluating rules: There are two general approaches to evaluating rules: craft tradition and analysis. Craft tradition often gives excellent results, in which case analysis is superfluous. But since the question is asked in .0, I'll assume that craft tradition is inadequate here, and look to analysis. Decision theory is the basis for this analysis (_Statistics_, Winkler and Hays, Holt, NY, 1975). We would like to minimize the costs of making the decision, and also the cost of making errors. The goal in this problem is to minimize the sum of the following costs, all measured per unit time to keep them comparable: computation storage delay while statistics accumulate incorrect shutdown signals incorrect stayup signals In prinicple, we can evaluate any rule by determining its costs in the four categories above. Then the best rule is the one with the lowest cost. A good rule should have a few other characteristics: It should have parameters which allow us to tailor it to specific situations. It should provide some guidance to setting the parameters based on our understanding, to avoid pure guesswork. It should be easy to understand and to explain. In practice, it is often easier to look for a rule which meets the criteria above and allows us to estimate the cost. So I'll derive a rule from decision theory and show that it is almost identical to that given above. Assume that we are monitoring a unit which may exist in two states GOOD and BAD. In state GOOD it shows a low but non-zero probability of failure. In state BAD it shows a high but non-unit probability of failure. This analysis can be extended to many states but it is messy. Assume further that we can take two actions: NOP or CCCL. NOP just means continue using the unit and monitoring, CCCL means take it off line and hoist the repair flag. We can tabulate the costs in the following table: GOOD BAD NOP 0 CBAD CCCL CGOOD 0 CBAD is the cost of using the unit even though it is bad, and CGOOD is the cost of not using the unit when it is still good. If P(S) is the probability that the unit is in state S, and EC(A) is the expected cost of the action A, then EC(NOP) = 0 * P(GOOD) + CBAD * P(BAD) EC(CCCL) = CGOOD * P(GOOD) + 0 * P(BAD) At a given time, we want to take the action which has the lowest expected cost. At the crossover point, the two expected costs are equal, which leads to the inequality of ratios P(BAD) CGOOD ------ > ----- P(GOOD) CBAD and the decision rule that we take action NOP unless this inequality is satisfied. Now assume that we observe the system and that the outcome of our observation has two states: SUCCESS and FAILURE. Again we can generalize to more than two states, but I'll pass on that. Then the observation can be used to modify the P(S) using Bayes' Theorem: P(BAD\|SUCCESS) P(SUCCESS\|BAD) P(BAD) -------------- = -------------- * ------ P(GOOD\|SUCCESS) P(SUCCESS\|GOOD) P(GOOD) P(BAD\|FAILURE) P(FAILURE\|BAD) P(BAD) -------------- = -------------- * ------ P(GOOD\|FAILURE) P(FAILURE\|GOOD) P(GOOD) and these formulas can be used iteratively: every time you get a success use the first formula to recompute the probabilities, and every time you get a failure, use the second formula. The iteration needs to be bounded, so that a long string of successes does not convince us that the device is perfect. We put a limit on P(BAD\| long string of successes ) --------------------------------- P(GOOD\| long string of successes ) to be the probability ratio indicating the unit will go BAD, even though it has been GOOD up to now. We start the iteration at P(BAD\| startup ) ---------------- P(GOOD\| startup ) which is the probability ratio indicating that the unit is BAD on startup. We terminate the iteration when the probability ratio satisfies the inequality above. This gives us a good rule, except that it requires a lot of multiplication and forces us to deal some with very small numbers. We can convert to addition and deal with reasonable size numbers by taking the logarithms of everything in sight. We can then take advantage of the fact that our basic inequality is unaffected by a linear transform x' = d * x + e of the logarithms. We choose e so that P(BAD\| long string of successes ) --------------------------------- P(GOOD\| long string of successes ) has the value zero, and we choose d so that P(SUCCESS\|BAD) -------------- P(SUCCESS\|GOOD) has the value of -1. This allows us to record a success by just decrementing a counter. But we can save work by just saving the old time in ticks and then when a failure occurs, subtracting the elapsed ticks all at once. Now observe that the rule at the top of this note is exactly the rule we have derived. We can give a real meaning to each of the parameters in the rule above. The threshold value is just the linear transform of the logarithm of CGOOD ----- CBAD The constant c is just the linear transform of the logarithm of P(FAILURE\|BAD) -------------- P(FAILURE\|GOOD) minus one, if I understand the notation of .4 correctly. Now I will argue that the rule given above is a good one. It uses little memory, about 8 bytes, although .2 uses 5 or 6. It uses little computation per failure, a bounded subtraction and an addition and a compare per failure. It has little delay while failures accumulate. The size of the delay depends on the ratios above, but probably corresponds to about five failures, in a realistic situation. By the derivation, the cost of incorrect signals has been minimized. It has parameters which can tailor it to a situation, and the parameters can be chosen with a minimum of guesswork. It is fairly easy to explain and understand. I think there is a way to strengthen the argument above, by proving that the statistic is a complete summary of the relevant information but I don't remember how to do that. Anyway, I think it requires assuming a Poisson process, and as .2 says, errors really come in bursts. This rule is also closely related to the Sequential Probability Ratio Test of Abraham Wald, although his reasoning process was different. Finally, note that all these notes have assumed that failures per time is a relevant statistic. In some cases, failures per trial (per packet sent, per controller command, per memory reference, or whatever) may be better. This approach can be applied to either situation, although failures per trial requires that we decrement a counter bounded by zero, once for each success.
869.9	Periodic sampling vs. real-time processing	POOL::HALLYB	Lookit all the happy creatures dancin' on the lawn	`Mon May 09 1988 13:51`	39
	.8 looks like a really good approach. Thanks, Wally! Looking things over, I note that there are two different ways one might observe errors: [1] By catching them on-the-fly, say within the device driver, and running the .8 algorithm on the spot. Very important for certain real-time devices. [2] By periodic inspection of error counts, from a layered product. I wonder if .8 is sensitive to which approach is used. It would seem not to matter too much. From a practical standpoint, option [2] is really the only one to use "systemwide", inasmuch as device drivers are written by disparate groups and have disparate error characteristics, often unknown when the driver is first written. If one periodically inspects the error count (say every t1 units) and observes 0 errors, then everything is fine ([1] == [2]). If one observes N > 0 errors per interval, then arbitrarily assign event times uniformly within the interval and evaluate E.v N times, once for each error. (a) If E.v is greater than the error threshold, CCCL. (b) If E.v is less than the error threshold, and N > 1, then reduce the sample interval t1 to something shorter. The notion here is that you MAY have had a burst above the acceptible values, but because of a long t1 you did not get a chance to see it. This is the case where [1] is more accurate than [2]. Reduce your sampling interval to obtain a more accurate estimate of the time of each error. This leaves open the question of what the minimum sampling interval should be, but if errors are THAT frequent you might want to cluster errors into groups of M instead of trying to deal with them individually. John
869.10	fixed time sampling actually simplifies	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Tue May 10 1988 12:46`	58
	re: < Note 869.9 by POOL::HALLYB "Lookit all the happy creatures dancin' on the lawn" > -< Periodic sampling vs. real-time processing >- The approach [1] has one big advantage, the minimum delay time between when the device goes BAD and when CCCL is executed. This advantage might be enough to motivate DEC to find a way to exploit it. For example, the device driver may respond to an error by calling back to the exec which has the code and tables to compute .8. And some facility could be provided to tune these tables as error behavior became known. If we use method 2, then the same ideas apply but the math is different. It is roughly equivalent to that suggested in .9, but somewhat simpler. And we have to face up to our ignorance of the burstiness of the errors. We need to replace the fractions in .8 with fractions like P( k \| BAD ) ------------ P( k \| GOOD ) where k is the error count in the interval, and is allowed to run from 0 up to n. If the error count is greater than n, assume the device is BAD and CCCL. The value of n can be computed from the cost fraction and the lower bound on the fraction P( BAD ) -------- P( GOOD ) So the program keeps a table of the scaled logarithms of P( k \| BAD ) ------------ P( k \| GOOD ) (some positive and some negative) uses the actual count to index into the table, and adds the value as before. Now there is no separate subtraction for time, because all time intervals are the same, so it is built into the fraction above. All the work I've seen calculates these fractions by assuming a Poisson process, with no burstiness. If you really want to include burstiness, you would have to observe the device for a while and derive the fractions empirically. Assuming a Poisson process, the program could calculate the fraction on the fly, but I'd guess that storing a table is better unless n is large. > This leaves open the question of what the minimum sampling interval > should be, but if errors are THAT frequent you might want to cluster > errors into groups of M instead of trying to deal with them individually. The interval is basically determined by the tradeoff between the cost of computing versus the cost of running with a BAD device. Rather than estimate these probably unknowable costs, I'd hazard a guess that the sampling interval should be chosen so that the whole error checking process took about 0.01% of CPU time.
869.11	Fixed-time does seem workable	19763::HALLYB	Lookit all the happy creatures dancin' on the lawn	`Wed May 11 1988 12:37`	26
	>> This leaves open the question of what the minimum sampling interval >> should be, but if errors are THAT frequent you might want to cluster >> errors into groups of M instead of trying to deal with them individually. > > The interval is basically determined by the tradeoff between the > cost of computing versus the cost of running with a BAD device. > Rather than estimate these probably unknowable costs, I'd hazard > a guess that the sampling interval should be chosen so that the > whole error checking process took about 0.01% of CPU time. Yes, that seems reasonable. The notion behind the grouping suggestion in .9 was that there are certain "errors" that actually should NOT be handled individually. Various Ethernet events come to mind, as do magtape read errors. These have a high cost of computing, owing to their frequency of occurrence. On the other hand, errors on the system disk probably deserve a lot of analysis (is it the same block as some earlier error? Same read head? Critical file? etc.). Ditto for correctible memory reads. If you take the assumption that most sample intervals will typically only require updating of a table entry, then you can calculate the sampling interval needed to take some arbitrarily small fraction of CPU time. I think that an interval of about 10 seconds meets the 0.01% figure from .10. At least for my local 8550. John
869.12		CLT::GILBERT		`Wed May 11 1988 12:50`	3
	I've either forgotten, or missed something. Instead of sampling the numbers at regular time intervals, why not do the processing (and possibly the CCCL) when an error actually occurs?
869.13	We're not writing the driver here	POOL::HALLYB	Lookit all the happy creatures dancin' on the lawn	`Wed May 11 1988 17:02`	9
	> Why not do the processing (and possibly the CCCL) when an error actually occurs? Because THAT code was written by somebody else, all you have is an binary device driver, and all you see is an error count occasionally. Or perhaps your device handles errors on its own and doesn't bother to tell you until you explicitly inquire about errors. John
869.14	Thresholding	ELWD2::CHINNASWAMY		`Wed May 18 1988 15:14`	8
	If the thresholding is used to identify a problem with the device, then along with the time usage must also be taken into account. I presume that the preceding discussions assume constant usage of the device. For instance, if you don't use a disk, you may not see an error.
869.15	hidden humor in .14 made explicit	CLT::GILBERT		`Wed May 18 1988 17:46`	2
	Q: "My disk has a high error rate. What should I do?" A: "Do disk operations less frequently."