Title: | Mathematics at DEC |
Moderator: | RUSURE::EDP |
Created: | Mon Feb 03 1986 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 2083 |
Total number of notes: | 14613 |
I am in search for some algorithms for thesholding and filtering the data found in VMS error logs. Let me first give you some history and then my thoughts on some algorithms. I solicit you to (1) comment on my suggestion of modeling the error stream from a device as a autoregressive process and (2) suggest some other models. Currently FM CSSE supports the product VAXsimPLUS which attempts to predict failures in various devices and isolate the FRU (field replacable unit). VAXsimPLUS is sort of a prototype or proof of concept which means that many aspects are suboptimal. Consequently we are investigating the implementation of a more definitive product. VAXsimPLUS, as you might expect, is layered on VAXsim. VAXsim does straight thresholding, no analysis. It keeps a tally of hard, soft, and media errors for each device based on what passes thru the error log mail box. When a margin is exceeded in VAXsim, SPEAR (a pseudo expert system) is spawned to analyze the error log. Based on the results of the analysis and the cluster configuration, VAXsimPLUS might initiate a shadow copy to save the customer's data. It will also send a theory number to the FIELD account and a message to the system manager to call FS (Field service). The term "margin" has a specific meaning in the context of VAXsimPLUS. The actual threshold which triggers analysis via SPEAR is an arithmatic function of time, the error count and the margin. Specifically, the margin is an integer typically around 5 or 10. There is a hard, soft and media margin for each device type. Initially, the threshold is equal to the margin. After each time a device crosses threshold and SPEAR analysis is triggered, the threshold is doubled. After 24 hours, the threshold is set back to the value of the margin. Well, well... What do you think of this two step process where the second stage is a (pseudo) expert system? (I say pseudo because its just a bunch of compiled BLISS statements - we cannot use something like PROLOG because we don't want the competition to reverse engineer our algorithms easily). We've been contemplating the notion of implementing the successor to SPEAR analysis as an autotomaton. (I wonder if we could use GALLILEO or YACC to generate this automaton?) I envision a state machine that just sits out there indefinitely monitoring the stream of errors that are enroute to the error log. This would be considerably less compute (and I/O) intensive than occaissionaly invoking an expert system that would read the last 20 megatons of error log files to figure out what is wrong. If we were to use an automaton do we need some notion of time that the VAXsim margining currently provides? To answer this question, we need to understand how devices fail. I believe VAXsimPLUS was implemented around the notion (1) that the error rate increases exponentially as a function of time once a device starts to fail and (2) it is sufficient to perform notification (ie, send mail to Field Service and system manager) about once a day. A greater frequency is undesirable. My thought is that if we use an finite state automaton, we only perform notification when we enter a new failure state. Since the cost of invoking an expert system is no longer an issue, we won't have to worry about performing notification too often. The problem with using a finite state automaton is with history: we might have to perserve state accross system crashes. We would also have to deal with the fact that any one VAX node in a cluster might not see all the errors any given device generates. The other issue is (presumably noise) filtering which is a related (perhaps redundant) question: How do we decide what information in the error log we want to look at and what we want to ignore? Currently I'm taking a class in adaptive filters. I wonder if there are any possible applications of adaptive filter theory here. I wonder if you can consider the error stream (log) wide sense stationary? Probably not in light of the failure patterns of devices. I wonder if you could model the error stream for any specific device as an AR (autoregressive) process. AR processes have the form: v(n) = a u(n) + a u(n-1) + a u(n-2) + ... 0 1 2 where v(n) is the white noise process and u is some signal and "a" is the vector of AR coefficients. You can take a white noise sequence as input to an AR process and produce a certain signal "u", or take "u" and feed it to an AR process to get a white noise sequence. The latter is extremely useful because, you can subtract the noise sequence output by the AR process from the corrupted signal ("u") to get the original signal. If you use an adaptive filter (one that alters the "a" vector on the fly) we can better accomodate non-stationary signals. If it is useful to model the stream of hard, soft and media errors for a device as a AR process, how do we (mathmatically) define noise and just what would the "original signal" represent? This is also posted in the algorithms conference.
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
1400.1 | I'd suggest Bayesian Analysis | CSSE::NEILSEN | Wally Neilsen-Steinhardt | Fri Mar 22 1991 14:17 | 62 |
One preliminary word of warning: VAXsim and especially SPEAR were based on a lot of real-world knowledge about how devices fail. You should be sure that any replacement product does not lose all that real-world knowledge. What follows here, for example, is rather abstract and would have to be modified somewhat to make a better fit with the real world. I seem to remember discussing this years ago with the SPEAR folks, or perhaps it was in this conference. Anyway, ... Look at this as a pair (at least) of decisions to be made by your tool: should it ring an alarm bell or not? What FRU (or FRUs) should it advise the FSE to replace? A simple and fairly general decision algorithm is based on Bayesian probability analysis. What follows is a simplified application of this analysis to your case. Assume a device can be characterized by a single "quality" variable Q(t), which represents its ability to perform at time t. For convenience, scale it so that Q=1 means perfect success, and Q=0 means perfect failure. It would be nice to measure Q directly, but we will assume that that is impossible, and we are forced to infer Q from an error history. Represent the error history E as a sequence of the n most recent successes and failures, and call a specific error history Ei. We want to compute Q from Ei. First we compute P( Ei | Q ), the probability of seeing sequence Ei, given Q. Then we must assign P( Q ), called prior probabilities to Q. This is our estimate, before we look at Ei, of the probability that the device is in state Q. We can get this from previous experience with devices of this type or from the earlier history of this specific device. Then we can compute the probability of Q using Bayes' Theorem: P( Q | Ei ) = P( Ei | Q ) * P( Q ) / ( SUM ( P( Ei | Q ) * P( Q ) ) The denominator is a normalizing factor, and if Q is a continuous variable then it becomes an integral. If we assume that Q is constant, that Ei is generated by a Bernoulli process, and that our prior probability can be represented using one of the family of beta functions, then the sum and quotient become very simple. P( Q | Ei ) must also be a beta function, and there is a simple relation between the parameters of prior (before Ei) beta function and posterior (after Ei) beta function. Once you have the beta function, you can compute its mean, variance and suitable confidence intervals about the mean. You can also combine it with a utility function, which measures the utility of various decisions by your tool, to generate a decision rule for the tool. This decision rule will probably be a simple counting rule, based on something like the count of failures in the last n tries. You will definitely need something like a count of tries or its surrogate, a time interval, to scale the count of failures. Of course, your Q is by assumption not constant, so your Bayesian analysis would be more complex. One way to deal with this is to assume that Q changes in step functions, with a low probability of changing during the sample Ei. Then you compute the current Qi based on the current Ei and the previous Q(i-1), allowing for the small probability that Q has changed. Another way to deal with it is to put the empirical form of Q(t) into your calculations of P( Ei | Q ). That was a very quick and sloppy overview of the direction I would recommend to you. Let me know if you want to follow up on it. I think you will end up with an automoton, but perhaps a very simple one, and perhaps the same one used by VAXsim. |