T.R | Title | User | Personal Name | Date | Lines |
---|
131.1 | We make upto ten attempts before giving up | WAKEME::ANIL | | Tue May 15 1990 22:27 | 36 |
|
>> How does Alarms handle the situation when it cannot evaluate a rule at
>> its scheduled interval?
>>
>> It looks as though the situation is ignored (at least there are no user
>> visible indications that a poll was missed), and polling continues at
>> the next scheduled polling time in the future. Is this a correct
>> conclusion?
>>
The Algorithm used by Alarms is as follows:
The Rule is enabled for its schedule time. If the IM (working as a
scheduler in MCC Kernel) indicates that the scheduled time is already
passed we reschedule the call. This loop continues for no more than ten
times. If after ten attempts we fail to schedule the call the rule gets
disabled.
In plain English all this translates in to "do your best to schedule the
call. If you miss some (ie up to ten consecutive) attempts that's OK!
But do give up after ten attempts".
Now is this the best approach? Not really. Some day we would like to
make the number of attempts as the MGMT parameter of each rule with 10
as its default value.
Now the question is should alarm bump the error count if "time already
passed" status comes back? The answer is yes! We may not be able to get
this done before EFT update, but we would like to try it for V1.0.
Does this sound okay to you? Let us know if we are on the right track!
Thanks,
- Anil
|
131.2 | So, if I understand this correctly... | DSTEG1::MCCANN | | Wed May 16 1990 10:34 | 41 |
| Ok. Let me see if I've got this right.
I have a rule whose expression I wish to have evaluated once every second.
However, say it takes 2.5 seconds to evaluate the expression (due to the
time it takes to go to the network and obtain the info required).
Would the flow of events look something like this:
TIME ACTION
---- --------------------------------------------
0 Rule begins scheduled evaluation
1 Rule is still being evaluated, so we miss this scheduled evaluation.
2 Rule is still being evaluated, so we miss this scheduled evaluation.
2.5 Rule ends evaluation
Try to schedule rule for evaluation at time 1, but time has passed.
Try to schedule rule for evaluation at time 2, but time has passed.
Schedule rule for evaluation at time 3 succeeds, time 3 has not passed.
3 Rule begins scheduled evaluation.
Is the following statement true:
"If a rule is being evaluated, the same rule will not begin
its next evaluation until the first evaluation has completed."
This is as opposed to having multiple concurrent evaluations of the same rule.
> Now the question is should alarm bump the error count if "time already
> passed" status comes back? The answer is yes! We may not be able to get
> this done before EFT update, but we would like to try it for V1.0.
Bumping a counter (or logging an error) sounds like a good idea. Do you
plan to support a COUNTERS attribute group for the alarms rule entity?
Thanks again,
Jack
|
131.3 | Yes we have COUNTERS and STATUS too for Rule subentity | WAKEME::ANIL | | Wed May 16 1990 19:59 | 50 |
| > Is the following statement true:
>
> "If a rule is being evaluated, the same rule will not begin
> its next evaluation until the first evaluation has completed."
>
>This is as opposed to having multiple concurrent evaluations of the same rule.
Yes it is true that rule will not be reschduled untill the privious
evaluation is complted. Note that time to evaluate is less than
.1 seconds. I have not been able schedule anything faster than
.5 seconds.
>
>Bumping a counter (or logging an error) sounds like a good idea. Do you
>plan to support a COUNTERS attribute group for the alarms rule entity?
>
>
Yes we do have counters! Enclosed is a sample output! The counter
and status atributes will be available in EFTupdate kit.
MCC 0 ALARMS RULE DTM_101
ALL ATTRIBUTES
AT 16-MAY-1990 18:49:45
NAME = DTM_101
+----------------------------------------------------------------+
| Result of Last Evaluation = True |
NEW | State = Enabled | Status
| Substate = Running |
| Time of Last Evaluation = 16-MAY-1990 18:49:34.79 |
+----------------------------------------------------------------+
+----------------------------------------------------------------+
NEW | Evaluation Error = 0 |
| Evaluation False = 0 | Counter
| Evaluation True = 2 |
+----------------------------------------------------------------+
Category = "LOG file notification"
Exception Handler = SYS$COMMON:[MCC]MCC$ALARMS_LOG_EXCEPTI
ON.COM;3
Expression = (bridge 08-00-2b-07-b9-f3 Bad Hello Li
mit > 10, AT every 00:00:30)
Parameter = "BINDU$DUA0:[ANIL.LOG]DTM_101.LOG"
Procedure = SYS$COMMON:[MCC]MCC$ALARMS_LOG_ALARM.C
OM;3
|
131.4 | More on Evaluation Time | WAKEME::ANIL | | Wed May 16 1990 20:30 | 38 |
| RE: <<< Note 131.2 by DSTEG1::MCCANN >>>
+-------------------------------------------------------------------------------+
|Would the flow of events look something like this: |
| |
|TIME ACTION |
|---- -------------------------------------------- |
| 0 Rule begins scheduled evaluation |
| |
| 1 Rule is still being evaluated, so we miss this scheduled evaluation. |
| |
| 2 Rule is still being evaluated, so we miss this scheduled evaluation. |
| |
| 2.5 Rule ends evaluation |
| Try to schedule rule for evaluation at time 1, but time has passed. |
| Try to schedule rule for evaluation at time 2, but time has passed. |
| Schedule rule for evaluation at time 3 succeeds, time 3 has not passed.|
| |
| 3 Rule begins scheduled evaluation. |
+-------------------------------------------------------------------------------+
The table you have made is essentially acurate. I would just like to
Define the Evaluation time as the processing time and the Time to fetch
the data. i.e.
Time to evaluate = (Time to fetch the data) +
(Time to process the data)
Say Te = Tf + Tp
Performance calculations done so far indicate that Tp < .1 seconds
Where as Tf
for DECNET Phase4 is ~ 15-20 seconds.
for Bridge is ~ 1-2 seconds
so you can see that most of the time is spent in the network I/O.
|
131.5 | I agree - Te = Tf + Tp | DSTEG1::MCCANN | | Thu May 17 1990 15:37 | 29 |
|
Yes. When I used the term "evaluate a rule", I meant it to include both
fetching the data and checking the rule expression.
I like Te = Tf + Tp. It's a good definition.
Since you've been so helpful, here's a couple more questions. :-)
Assume we have a rule that fetches a circuit characteristic from a remote
node4. When the rule begins evaluation, it issues a request to the
DECnet IV AM. The DECnet IV AM fetches the data (in a separate thread?
while the rule thread waits?).
In the act of fetching that data, a DECnet session is established
between MCC on the local node and NML on the remote node. The data is
requested and returned using NICE, then the session terminates. Is any of
the returned data processed by MCC before the session terminates? Or
is the session terminated, then the data processed?
Perhaps I should define what I think I mean be "process the data". I mean
the DECnet IV AM parses the NICE message, extracts the desired circuit
characteristic, and returns that characteristic to the alarms FM. In turn,
the alarms FM checks the rule expression using the data it just received.
In this example, could we use the duration of the session as Tf?
Thanks again,
Jack
|
131.6 | More on Tf | WAKEME::ANIL | | Thu May 17 1990 21:59 | 28 |
| > Perhaps I should define what I think I mean be "process the data". I mean
> the DECnet IV AM parses the NICE message, extracts the desired circuit
> characteristic, and returns that characteristic to the alarms FM. In turn,
> the alarms FM checks the rule expression using the data it just received.
>
> In this example, could we use the duration of the session as Tf?
From Alarms perpective Tf is the time taken between making the call and data
being returnd to Alarms. If we were to draw a generic time line for any AM
returning data as a result of show request it would look as follows:
|<-------------------- Tf ----------------------->|
|
-----------+---+---------------------------------------------+--
<- Alarms-->| ^ |<-- Validate the call --------------------->|
code ^ | Set up the context
| ` Lookup translation tables
call` Get the data
to ` translate back to MCC call parameters
AM `
`
IM/Dispatcher
As you can see from the above picture if DECNET Phase4 AM has to incure the
overhead of estblishing a Session it would still be part of Tf from Alarms
view point.
|
131.7 | That approach isn't very self limiting | CAPN::SYLOR | Architect = Buzzword Generator | Mon May 21 1990 18:31 | 31 |
| In an earlier approach to this problem we used a slightly different algorithm
for computing the "next" poll/evaluation. We took the "period" that the user
asked for (say 60 sec) and added it onto the time that we finished an
evaluation/poll and used that to determine the time of the start of the next
evaluation/poll. Example:
evaluation period 60 sec.
evaluation starts at 12:01:00
it takes 10 secs to gather data, evaluate it, etc. finishing at 12:01:10
start of next evaluation 12:02:10
Now in this example we actually take 70 seconds between evaluations.
The advantage of this approach was we never completely "hog" the system.
In fact, the algorithm is self stabalizing in that added load "slows the rate
of evaluation/polling", and lowers the workload causing the overload.
We never have to worry about "missed evaluations", in fact all you need do is
keep a counter of how many evaluations have been done. Over a long interval,
it is possible to compute the "actual" rate at which the evaluations are done.
If you *really* wanted to get fancy, you could compute that rate, compare
it against the "desired rate", and if they were more than X% slower,
raise an alarm!
I'm generally concerned about the 10 times then give up rule. If the period
between evaluations is large (relative to the average length of a computation)
then it never detects a problem (even if a computation takes 10 times the
expected period). When you shorten the period (say for stress testing by a
group like DSTEG, or by a customer), you'll suddenly see the rules start
disabling themselves as they one by one give up. That won't look good on
a test.
Mark
|