[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

4720.0. "Scheduling in an alarm rule went out in the blue..." by STKHLM::BERGGREN (Nils Berggren EIS/Project dpmt, Sweden DTN 876-8287) Fri Mar 19 1993 03:11

Hi,

Did some alarm-testing in batch this night and when I came in this morning
almost all processes on the system were suspended.  Why, the batch log-file 
had grown up to 65 Mb in size and the log file was on the system-disk which
had shrunk down to 220 free blocks. So the Audit server suspends execution 
on the system...

Ok, my fault - I shuld have put the Batch-log file on a another disk, I know!

But, what happend?

I have a moderatly integrated script, called SPECIALS, which returns three
attributes.   MSL-def like:

      ATTRIBUTE Salary = 1 : Integer32
        ACCESS     = NONSETABLE,
        DISPLAY    = TRUE,
        CATEGORIES = ( CONFIGURATION ),
        UNITS      = "Bananas"
      END ATTRIBUTE Salary;

      ATTRIBUTE A String = 2 : Latin1String
        ACCESS     = NONSETABLE,
        DISPLAY    = TRUE,
        CATEGORIES = ( CONFIGURATION )
      END ATTRIBUTE A String;

      ATTRIBUTE IQ-value = 3 : REAL
        ACCESS     = NONSETABLE,
        DISPLAY    = TRUE,
        CATEGORIES = ( CONFIGURATION ),
        UNITS      = "IQ-units"
      END ATTRIBUTE IQ-value;

I've created a SPECIALS-entity which calls the TROSKEL.COM-file.
The output from the script-file is like:

	TROSKEL 64  "NILS @MALEN 2" 20.64

I use the current system time to get the values '64' and '20.64'.
64 is the decimal part of the seconds and 20 is the seconds.
The script also writes a log-file with a time stamp and the values
that are written out to SY$OUTPUT.

I have created an alarm rule like:
	create domain tvt-sverige rule troskel -
	       expression=(script NMS specials special salary > 30, -
	       at every 0:15:50.20), -
	       severity = Major

 
The command-procedure I submitted to batch is as follows:
	$ set proc/priv=all
	$ manage/enter
	  notify domain tvt-sverige
	  enable domain tvt-sverige rule troskel
	  show script NMS specials special, at start (+20:00:00)

I submitted it at 22:08 yesterday night and from the script's log-file
I can see that the it got called every 15:50 minutes.

This is the script's logfile where the 1:st column corresponds to 'Salary',
2:nd to 'IQ-value', 3:rd is time when called and last is the parent's 
process name.

	22 46.22 "1993-03-18 22:08:46.22"  parent=BATCH_33
	20 36.20 "1993-03-18 22:24:36.20"  parent=BATCH_33
	22 26.22 "1993-03-18 22:40:26.22"  parent=BATCH_33
	43 16.43 "1993-03-18 22:56:16.43"  parent=BATCH_33
	55 06.55 "1993-03-18 23:12:06.55"  parent=BATCH_33
	75 56.75 "1993-03-18 23:27:56.75"  parent=BATCH_33
	00 47.00 "1993-03-18 23:43:47.00"  parent=BATCH_33
	42 37.42 "1993-03-18 23:59:37.42"  parent=BATCH_33
	32 27.32 "1993-03-19 00:15:27.32"  parent=BATCH_33
	55 17.55 "1993-03-19 00:31:17.55"  parent=BATCH_33
	80 07.80 "1993-03-19 00:47:07.80"  parent=BATCH_33
	92 57.92 "1993-03-19 01:02:57.92"  parent=BATCH_33
	17 48.17 "1993-03-19 01:18:48.17"  parent=BATCH_33
	33 38.33 "1993-03-19 01:34:38.33"  parent=BATCH_33
	28 40.28 "1993-03-19 01:34:40.28"  parent=BATCH_33
	18 42.18 "1993-03-19 01:34:42.18"  parent=BATCH_33
	02 44.02 "1993-03-19 01:34:44.02"  parent=BATCH_33
	81 45.81 "1993-03-19 01:34:45.81"  parent=BATCH_33
	63 47.63 "1993-03-19 01:34:47.63"  parent=BATCH_33

From the batch-logfile I see the the rule fires at 22:56:17 which is
the first time the salary passes 30.  So far so good.  Continuing in
the batch-log file I see that the rule fires and clears until 1:34 AM
where the script  *** GETS CALLED EVERY 1 OR 2 SECONDS! ***


At  01:34:39 I get the following alarm which looks alright to me:

!!!!!!!!!!!!!! Alarm, 19-MAR-1993 01:34:39 !!!!!!!!!!!!!! [1]
Domain: MALEN_NS:.TVT-SVERIGE                         Severity: Major
Notification Entity: Script MALEN_NS:.NMS SPECIALS TROSKEL 
Event Source: Domain MALEN_NS:.TVT-SVERIGE Rule troskel 
Event: OSI Rule Fired

                             Event Type = QualityofServiceAlarm
                             Event Time = 19-MAR-1993 01:34:38.83
                         Probable Cause = Unknown
                        Additional Info = { (
            
                   significance = True,
                                information = "Rule fired: Script MALEN_NS:.NMS 
                                              SPECIALS TROSKEL Salary = 33 
                                              19-MAR-1993 01:34:38.83" ),
                                            (
                               significance = True,
                                information = "(script NMS specials special 
                                              salary > 30, at eve
ry 0:15:50.20)" ) }
                         Managed Object = Script MALEN_NS:.NMS SPECIALS TROSKEL 
                     Perceived Severity = Major



but after this, I get 3 alarms in a row within a few seconds and at 01:34:49 
I get the following alarm:

!!!!!!!!!!!!!! Alarm, 19-MAR-1993 01:34:49 !!!!!!!!!!!!!! [1]
Domain: MALEN_NS:.TVT-SVERIGE                         Severity: Indeterminate
Notification Entity: MCC MALEN_NS:.MCC.MALEN_DIRECTOR ALARMS RULE troskel 
Event Source: Domain MALEN_NS:.TVT-SVERIGE Rule troskel 
Event: OSI Rule Exception

                             Event Type = QualityofServiceAlarm
                             Event Time = 19-MAR-1993 01:34:48.41
                         Probable Cause = Unknown
                        Additional Info = { (
                               significance = True,
                                information = "Software logic error 
                                              detected<EOS>%MCC-W-TIME_ALREADY_P
                                              A, scheduled time has already 
                                              passed" ),
                                            (
                               significance = True,
                                information = "(script NMS spe cials special 
                                              salary > 30, at every 0:15:50.20)" ) }
                         Managed Object = MCC 0 ALARMS RULE troskel 
                     Perceived Severity = Indeterminate



Then the rule fires as fast as it can, 2-3 times/second, until the logfile 
has covered the whole system-disk and the batch-process got suspended.

  
I don't know if this is reproducable but I thought I should tell you
about this.  It's not to often when something goes wrong you can tell
*EXACTLY* what you did.

The system I'm running is VMS 5.2, DECmcc BMS T1.3.1.


    regards,

	Nils


just a thought, how many lines are there in the batch-logfile?
  2.5 times/second * 60 secs * 60 min * 7 hours = 63,000 alarms 
  63,000 alarms * 24 lines/alarm = 1,512,000 lines......


T.RTitleUserPersonal
Name
DateLines
4720.1Overlapping : AT EVERY vs TIMEOUT ??ZTOIS1::VISTARenato VISTA, SIS Strasbourg, FranceFri Mar 19 1993 04:1715
    Nils,
    
    I've not tried to reproduce your problem, but may I suggest you to
    increase the Timeout attribute of the Script SPECIALS troskel instance,
    maybe to grant more time enough to script execution sequence.
    
    The Information given in the Exception Message Rule "... scheduled time
    has already passed" send by ALARM_FM might be due to overlapping
    between AT EVERY parameter used by ALARM FM and default TIMEOUT
    attribute used by SCRIPT AM.
    
    To be confirmed or infirmed...
    
    Renato
    
4720.2The most interesting problems with the Script AMMOLAR::ROBERTSKeith Roberts - Network Management ApplicationsFri Mar 19 1993 08:5428
  Nils,

  You find the most interesting problems with the Script AM !!  8)

  I can't explain what is happening here -- but I can discuss what
  goes on inside Alarms...

  The only thing I can think of is something is going wrong with the
  Scheduling time (at every 0:15:50.20) ... When Alarms calls the Script
  AM with this scheduling time, the call acutally blocks in the information
  manager (IM) until the 15 minutes or so has expired.  Then the Script AM
  is really called.

  The only way you could get back the 'time already passed' was if the next
  scheduled time (the next 15 minutes) went by BEFORE alarms got around
  to calling the Script AM again -- BUT -- Alarms will allow 10 'time
  already passed' errors before it returns the error.  That would be
  150+ minutes.  That does not make any sense, as according to your logs
  Alarms and the Script AM were working correctly up to the 'problem'.

  Could you mail me your script (MOLAR::ROBERTS) ?

  I'll try your test and see if I can make any sense of what is going
  wrong.

  /keith

  
4720.3Getting closer ...MOLAR::ROBERTSKeith Roberts - Network Management ApplicationsFri Mar 19 1993 15:0530
  Nils,

  I ran your test .. and a few of my own.   What I found is really weird.

  It seems that the LIB$SPAWN is not always executing.  Sometimes is just
  sits there without spawing or returning an error.  BUT - if you hit
  Return at the MCC> prompt, the job spawns and executs fine.

  If the Script AM can't spawn and gets hung, then after a while the 
  Information Manager will return the 'Time Already Passed' error to 
  Alarms, if the hang gets cleared up - but pressing Return.

  I don't know why this is happening.

  I also found out why the Script AM doesn't work with the FCL 'Forms'
  Mode.  The Script AM is returned an error that says:

	CLI-E-TRMMBX terminal has associated mailbox - terminate
	image then SPAWN

  The Forms mode attaches a mailbox to the terminal to catch any 
  unsolicited I/O - which it can then control by writing to a specific
  window.  Guess the rules say you can't spawn under these conditions.

  What is the solution?  Not sure just yet.  I'm going to look at
  SYS$CREPRC and see if that works better than LIB$SPAWN.

  /keith

  
4720.4Same error returned.VCSESU::WADEBill Wade, VAXc Systems &amp; Support EngMon Mar 22 1993 12:3853
    
    I had the same error message returned to a rule.  The log file contains
    the following -
     
    .
    .
    .
    $!++
    $! Run the User's Command Procedure
    $!--
    $ @SYS$COMMON:[MCC]MDFOMS$DEV_AVAIL_EXCEPTION.COM;2 "Domain
    LOCAL_NS:.mdfoms$devicepolling 
    		Rule OMS_BRDG_SITE2-620_AVAIL "-     !rulename
                          "Bridge availability"-    !category
                          ""-       !description
                          "(CHANGE_OF (BRIDGE bridge.site2-620 SECONDS
    				OPERATING, *,
    				*), AT  EVERY=00:00:30)"-       !expression
                          "22-MAR-1993 09:48:43.54"-        !time
                          "Software logic error
    			   detected<EOS>%MCC-W-TIME_ALREADY_PA, scheduled time 
    			   has already passed"-    !dtcrtf or error
                          ""-       !notification params
                          "SYS$SCRATCH:MCC_ALARMS_DATA_09484354.DAT"       
    				!file that contains more info about the rule
    $!
    .
    .
    .
    
    
    The rule is defined as -
    
    MCC> SHOW DOMA mdfoms$devicepolling RULE OMS_BRDG_SITE2-620_AVAIL ALL
    CHAR
    
    Domain LOCAL_NS:.mdfoms$devicepolling Rule OMS_BRDG_SITE2-620_AVAIL
    AT 22-MAR-1993 12:35:55 Characteristics
    
    Examination of attributes shows:
                      Alarm Fired Procedure =
    SYS$COMMON:[MCC]MDFOMS$DEV_AVAIL_ALARM.COM;2
                  Alarm Exception Procedure =
    SYS$COMMON:[MCC]MDFOMS$DEV_AVAIL_EXCEPTION.COM;2
                                Description = "Bridge availability"
                                Batch Queue = "MDFOMS$MCC_ALARM_BATCH"
                                 Expression = (CHANGE_OF (BRIDGE
    bridge.site2-620
                                              SECONDS OPERATING, *,*), AT
                                              EVERY=00:00:30)
                                   Severity = Critical
                             Probable Cause = Unknown
         
4720.5TIMEOUT shouldn't be the problem...STKHLM::BERGGRENNils Berggren EIS/Project dpmt, Sweden DTN 876-8287Mon Mar 22 1993 12:569
    re .1
    
    Renato, 
    
    The TIMEOUT shouldn't be the problem since the script takes 
    less than a second to execute, including a SPWAN which the SCRIPT AM
    does when executing the script.
    
      /Nils
4720.6Probably not the same problemMOLAR::ROBERTSKeith Roberts - Network Management ApplicationsMon Mar 22 1993 12:5816
The Bridge AM does not spawn a sub-process as does the Script AM, so this
problem is (probably) not the same .. however it IS a problem.

I would imagine that the Bridge AM is just not returning data to Alarms.
It should be timing out and returning 'Cannot Complete Operation' or some
such error.

What are the particulars?  Software versions and hardware types (ethernet
controller) ...

If this error occurs again - can you try the 'Show' directive to get the
data from the Bridge AM directly.  Alarms uses the Show directive.  If the
Show never returns with data or and error, then there is something wrong
in the Bridge AM or MCC ethernet sub-system.

/keith
4720.7VCSESU::WADEBill Wade, VAXc Systems &amp; Support EngMon Mar 22 1993 13:3239
       
    re .6

   >      
   > What are the particulars?  Software versions and hardware types (ethernet
   > controller) ...
   >      

	VMS 5.5-2
	DECmcc 1.2 

	MCC> show mcc 0 bridge_am     all att

	MCC 0 BRIDGE_AM
	AT 22-MAR-1993 13:33:41 All Attributes

                        Available Ports = { "EZA0:" }
                      Component Version = V1.2.1
               Component Identification = "DECmcc Bridge AM"


 	Info on the bridge -

	                          Hardware Type = DEFEB DECbridge 620
                       Software Version = "V9.9.0 FT"
                            ROM Version = V2.9.0
 
   > If this error occurs again - can you try the 'Show' directive to get the
   > data from the Bridge AM directly.  Alarms uses the Show directive.  If the
   > Show never returns with data or and error, then there is something wrong
   > in the Bridge AM or MCC ethernet sub-system.
 
	I tried the show bridge command shortly after this error and
	it was successful.
     

	Thanks,
	Bill