[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | DECmcc user notes file. Does not replace IPMT. |
Notice: | Use IPMT for problems. Newsletter location in note 6187 |
Moderator: | TAEC::BEROUD |
|
Created: | Mon Aug 21 1989 |
Last Modified: | Wed Jun 04 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 6497 |
Total number of notes: | 27359 |
4720.0. "Scheduling in an alarm rule went out in the blue..." by STKHLM::BERGGREN (Nils Berggren EIS/Project dpmt, Sweden DTN 876-8287) Fri Mar 19 1993 03:11
Hi,
Did some alarm-testing in batch this night and when I came in this morning
almost all processes on the system were suspended. Why, the batch log-file
had grown up to 65 Mb in size and the log file was on the system-disk which
had shrunk down to 220 free blocks. So the Audit server suspends execution
on the system...
Ok, my fault - I shuld have put the Batch-log file on a another disk, I know!
But, what happend?
I have a moderatly integrated script, called SPECIALS, which returns three
attributes. MSL-def like:
ATTRIBUTE Salary = 1 : Integer32
ACCESS = NONSETABLE,
DISPLAY = TRUE,
CATEGORIES = ( CONFIGURATION ),
UNITS = "Bananas"
END ATTRIBUTE Salary;
ATTRIBUTE A String = 2 : Latin1String
ACCESS = NONSETABLE,
DISPLAY = TRUE,
CATEGORIES = ( CONFIGURATION )
END ATTRIBUTE A String;
ATTRIBUTE IQ-value = 3 : REAL
ACCESS = NONSETABLE,
DISPLAY = TRUE,
CATEGORIES = ( CONFIGURATION ),
UNITS = "IQ-units"
END ATTRIBUTE IQ-value;
I've created a SPECIALS-entity which calls the TROSKEL.COM-file.
The output from the script-file is like:
TROSKEL 64 "NILS @MALEN 2" 20.64
I use the current system time to get the values '64' and '20.64'.
64 is the decimal part of the seconds and 20 is the seconds.
The script also writes a log-file with a time stamp and the values
that are written out to SY$OUTPUT.
I have created an alarm rule like:
create domain tvt-sverige rule troskel -
expression=(script NMS specials special salary > 30, -
at every 0:15:50.20), -
severity = Major
The command-procedure I submitted to batch is as follows:
$ set proc/priv=all
$ manage/enter
notify domain tvt-sverige
enable domain tvt-sverige rule troskel
show script NMS specials special, at start (+20:00:00)
I submitted it at 22:08 yesterday night and from the script's log-file
I can see that the it got called every 15:50 minutes.
This is the script's logfile where the 1:st column corresponds to 'Salary',
2:nd to 'IQ-value', 3:rd is time when called and last is the parent's
process name.
22 46.22 "1993-03-18 22:08:46.22" parent=BATCH_33
20 36.20 "1993-03-18 22:24:36.20" parent=BATCH_33
22 26.22 "1993-03-18 22:40:26.22" parent=BATCH_33
43 16.43 "1993-03-18 22:56:16.43" parent=BATCH_33
55 06.55 "1993-03-18 23:12:06.55" parent=BATCH_33
75 56.75 "1993-03-18 23:27:56.75" parent=BATCH_33
00 47.00 "1993-03-18 23:43:47.00" parent=BATCH_33
42 37.42 "1993-03-18 23:59:37.42" parent=BATCH_33
32 27.32 "1993-03-19 00:15:27.32" parent=BATCH_33
55 17.55 "1993-03-19 00:31:17.55" parent=BATCH_33
80 07.80 "1993-03-19 00:47:07.80" parent=BATCH_33
92 57.92 "1993-03-19 01:02:57.92" parent=BATCH_33
17 48.17 "1993-03-19 01:18:48.17" parent=BATCH_33
33 38.33 "1993-03-19 01:34:38.33" parent=BATCH_33
28 40.28 "1993-03-19 01:34:40.28" parent=BATCH_33
18 42.18 "1993-03-19 01:34:42.18" parent=BATCH_33
02 44.02 "1993-03-19 01:34:44.02" parent=BATCH_33
81 45.81 "1993-03-19 01:34:45.81" parent=BATCH_33
63 47.63 "1993-03-19 01:34:47.63" parent=BATCH_33
From the batch-logfile I see the the rule fires at 22:56:17 which is
the first time the salary passes 30. So far so good. Continuing in
the batch-log file I see that the rule fires and clears until 1:34 AM
where the script *** GETS CALLED EVERY 1 OR 2 SECONDS! ***
At 01:34:39 I get the following alarm which looks alright to me:
!!!!!!!!!!!!!! Alarm, 19-MAR-1993 01:34:39 !!!!!!!!!!!!!! [1]
Domain: MALEN_NS:.TVT-SVERIGE Severity: Major
Notification Entity: Script MALEN_NS:.NMS SPECIALS TROSKEL
Event Source: Domain MALEN_NS:.TVT-SVERIGE Rule troskel
Event: OSI Rule Fired
Event Type = QualityofServiceAlarm
Event Time = 19-MAR-1993 01:34:38.83
Probable Cause = Unknown
Additional Info = { (
significance = True,
information = "Rule fired: Script MALEN_NS:.NMS
SPECIALS TROSKEL Salary = 33
19-MAR-1993 01:34:38.83" ),
(
significance = True,
information = "(script NMS specials special
salary > 30, at eve
ry 0:15:50.20)" ) }
Managed Object = Script MALEN_NS:.NMS SPECIALS TROSKEL
Perceived Severity = Major
but after this, I get 3 alarms in a row within a few seconds and at 01:34:49
I get the following alarm:
!!!!!!!!!!!!!! Alarm, 19-MAR-1993 01:34:49 !!!!!!!!!!!!!! [1]
Domain: MALEN_NS:.TVT-SVERIGE Severity: Indeterminate
Notification Entity: MCC MALEN_NS:.MCC.MALEN_DIRECTOR ALARMS RULE troskel
Event Source: Domain MALEN_NS:.TVT-SVERIGE Rule troskel
Event: OSI Rule Exception
Event Type = QualityofServiceAlarm
Event Time = 19-MAR-1993 01:34:48.41
Probable Cause = Unknown
Additional Info = { (
significance = True,
information = "Software logic error
detected<EOS>%MCC-W-TIME_ALREADY_P
A, scheduled time has already
passed" ),
(
significance = True,
information = "(script NMS spe cials special
salary > 30, at every 0:15:50.20)" ) }
Managed Object = MCC 0 ALARMS RULE troskel
Perceived Severity = Indeterminate
Then the rule fires as fast as it can, 2-3 times/second, until the logfile
has covered the whole system-disk and the batch-process got suspended.
I don't know if this is reproducable but I thought I should tell you
about this. It's not to often when something goes wrong you can tell
*EXACTLY* what you did.
The system I'm running is VMS 5.2, DECmcc BMS T1.3.1.
regards,
Nils
just a thought, how many lines are there in the batch-logfile?
2.5 times/second * 60 secs * 60 min * 7 hours = 63,000 alarms
63,000 alarms * 24 lines/alarm = 1,512,000 lines......
T.R | Title | User | Personal Name | Date | Lines |
---|
4720.1 | Overlapping : AT EVERY vs TIMEOUT ?? | ZTOIS1::VISTA | Renato VISTA, SIS Strasbourg, France | Fri Mar 19 1993 04:17 | 15 |
| Nils,
I've not tried to reproduce your problem, but may I suggest you to
increase the Timeout attribute of the Script SPECIALS troskel instance,
maybe to grant more time enough to script execution sequence.
The Information given in the Exception Message Rule "... scheduled time
has already passed" send by ALARM_FM might be due to overlapping
between AT EVERY parameter used by ALARM FM and default TIMEOUT
attribute used by SCRIPT AM.
To be confirmed or infirmed...
Renato
|
4720.2 | The most interesting problems with the Script AM | MOLAR::ROBERTS | Keith Roberts - Network Management Applications | Fri Mar 19 1993 08:54 | 28 |
| Nils,
You find the most interesting problems with the Script AM !! 8)
I can't explain what is happening here -- but I can discuss what
goes on inside Alarms...
The only thing I can think of is something is going wrong with the
Scheduling time (at every 0:15:50.20) ... When Alarms calls the Script
AM with this scheduling time, the call acutally blocks in the information
manager (IM) until the 15 minutes or so has expired. Then the Script AM
is really called.
The only way you could get back the 'time already passed' was if the next
scheduled time (the next 15 minutes) went by BEFORE alarms got around
to calling the Script AM again -- BUT -- Alarms will allow 10 'time
already passed' errors before it returns the error. That would be
150+ minutes. That does not make any sense, as according to your logs
Alarms and the Script AM were working correctly up to the 'problem'.
Could you mail me your script (MOLAR::ROBERTS) ?
I'll try your test and see if I can make any sense of what is going
wrong.
/keith
|
4720.3 | Getting closer ... | MOLAR::ROBERTS | Keith Roberts - Network Management Applications | Fri Mar 19 1993 15:05 | 30 |
| Nils,
I ran your test .. and a few of my own. What I found is really weird.
It seems that the LIB$SPAWN is not always executing. Sometimes is just
sits there without spawing or returning an error. BUT - if you hit
Return at the MCC> prompt, the job spawns and executs fine.
If the Script AM can't spawn and gets hung, then after a while the
Information Manager will return the 'Time Already Passed' error to
Alarms, if the hang gets cleared up - but pressing Return.
I don't know why this is happening.
I also found out why the Script AM doesn't work with the FCL 'Forms'
Mode. The Script AM is returned an error that says:
CLI-E-TRMMBX terminal has associated mailbox - terminate
image then SPAWN
The Forms mode attaches a mailbox to the terminal to catch any
unsolicited I/O - which it can then control by writing to a specific
window. Guess the rules say you can't spawn under these conditions.
What is the solution? Not sure just yet. I'm going to look at
SYS$CREPRC and see if that works better than LIB$SPAWN.
/keith
|
4720.4 | Same error returned. | VCSESU::WADE | Bill Wade, VAXc Systems & Support Eng | Mon Mar 22 1993 12:38 | 53 |
|
I had the same error message returned to a rule. The log file contains
the following -
.
.
.
$!++
$! Run the User's Command Procedure
$!--
$ @SYS$COMMON:[MCC]MDFOMS$DEV_AVAIL_EXCEPTION.COM;2 "Domain
LOCAL_NS:.mdfoms$devicepolling
Rule OMS_BRDG_SITE2-620_AVAIL "- !rulename
"Bridge availability"- !category
""- !description
"(CHANGE_OF (BRIDGE bridge.site2-620 SECONDS
OPERATING, *,
*), AT EVERY=00:00:30)"- !expression
"22-MAR-1993 09:48:43.54"- !time
"Software logic error
detected<EOS>%MCC-W-TIME_ALREADY_PA, scheduled time
has already passed"- !dtcrtf or error
""- !notification params
"SYS$SCRATCH:MCC_ALARMS_DATA_09484354.DAT"
!file that contains more info about the rule
$!
.
.
.
The rule is defined as -
MCC> SHOW DOMA mdfoms$devicepolling RULE OMS_BRDG_SITE2-620_AVAIL ALL
CHAR
Domain LOCAL_NS:.mdfoms$devicepolling Rule OMS_BRDG_SITE2-620_AVAIL
AT 22-MAR-1993 12:35:55 Characteristics
Examination of attributes shows:
Alarm Fired Procedure =
SYS$COMMON:[MCC]MDFOMS$DEV_AVAIL_ALARM.COM;2
Alarm Exception Procedure =
SYS$COMMON:[MCC]MDFOMS$DEV_AVAIL_EXCEPTION.COM;2
Description = "Bridge availability"
Batch Queue = "MDFOMS$MCC_ALARM_BATCH"
Expression = (CHANGE_OF (BRIDGE
bridge.site2-620
SECONDS OPERATING, *,*), AT
EVERY=00:00:30)
Severity = Critical
Probable Cause = Unknown
|
4720.5 | TIMEOUT shouldn't be the problem... | STKHLM::BERGGREN | Nils Berggren EIS/Project dpmt, Sweden DTN 876-8287 | Mon Mar 22 1993 12:56 | 9 |
| re .1
Renato,
The TIMEOUT shouldn't be the problem since the script takes
less than a second to execute, including a SPWAN which the SCRIPT AM
does when executing the script.
/Nils
|
4720.6 | Probably not the same problem | MOLAR::ROBERTS | Keith Roberts - Network Management Applications | Mon Mar 22 1993 12:58 | 16 |
| The Bridge AM does not spawn a sub-process as does the Script AM, so this
problem is (probably) not the same .. however it IS a problem.
I would imagine that the Bridge AM is just not returning data to Alarms.
It should be timing out and returning 'Cannot Complete Operation' or some
such error.
What are the particulars? Software versions and hardware types (ethernet
controller) ...
If this error occurs again - can you try the 'Show' directive to get the
data from the Bridge AM directly. Alarms uses the Show directive. If the
Show never returns with data or and error, then there is something wrong
in the Bridge AM or MCC ethernet sub-system.
/keith
|
4720.7 | | VCSESU::WADE | Bill Wade, VAXc Systems & Support Eng | Mon Mar 22 1993 13:32 | 39 |
|
re .6
>
> What are the particulars? Software versions and hardware types (ethernet
> controller) ...
>
VMS 5.5-2
DECmcc 1.2
MCC> show mcc 0 bridge_am all att
MCC 0 BRIDGE_AM
AT 22-MAR-1993 13:33:41 All Attributes
Available Ports = { "EZA0:" }
Component Version = V1.2.1
Component Identification = "DECmcc Bridge AM"
Info on the bridge -
Hardware Type = DEFEB DECbridge 620
Software Version = "V9.9.0 FT"
ROM Version = V2.9.0
> If this error occurs again - can you try the 'Show' directive to get the
> data from the Bridge AM directly. Alarms uses the Show directive. If the
> Show never returns with data or and error, then there is something wrong
> in the Bridge AM or MCC ethernet sub-system.
I tried the show bridge command shortly after this error and
it was successful.
Thanks,
Bill
|