[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:	SCHEDULER
Notice:	Welcome to the Scheduler Conference on node HUMANEril
Moderator:	RUMOR::FALEK

Created:	Sat Mar 20 1993
Last Modified:	Tue Jun 03 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1240
Total number of notes:	5017

1069.0. "Scheduler hang(?) in SCHEDULER$DOO_COMMAND" by KERNEL::TITCOMBER () Tue Apr 16 1996 05:27

    Help!  Can anybody help to shed some light on this problem?

    I have a customer who has experienced a problem with Scheduler hanging
    a couple of times in the last week or so.  When I say hanging, I
    believe that to be the case, but am open to other suggestions.

    The customer notices the problem when jobs don't seem to complete at
    the expected time, or don't repeat themselves at the specified
    interval.  In the particular case of the last occurrence, one batch job
    had a total elapsed time from start to finish of 2 hours (from SCHED
    SHOW HISTORY/RECORDED_RUNS).  The job normally takes 20 seconds or so! 
    Another detached job is scheduled to run every 3 minutes, and yet SCHED
    SHOW HISTORY/RECORDED_RUNS shows a period of time where the job didn't
    run for 40 mins or so!

    Unfortunately(?), the condition eventually clears itself and as a
    result we have been unable to look at the system while in the problem
    state.  However, I have some output that shows some useful information
    about the batch job in particular.

    I have attached the output with more detailed explanations below.

    Any help or ideas would be greatly appreciated.

    Rich



    The customer is running Scheduler V2.1B-7 on a cluster of VAX systems
    running a mix of V5.5-2 and V6.1

    Here are the job details:

Job Name             Entry    User_name    State      Next Run Time
--------             -----    ---------    -----      -------------
SUN_FX_LA_FXI-LOGON  360      FXOPLA       Scheduled  21-APR-1996 13:00 [NXT]
VMS_Command : @NSCHED$fxn:WIN_FXN_LOGON.COM LA
Group : FXNET                              Type : SUNSTART
Comment : INFORM OPS TO START FXI LINKS
Last Start Time   : 14-APR-1996 21:03
Last Finish Time  : 14-APR-1996 21:03      Last Exit Status : SUCCESS
Schedule Interval : D 13:00:00             Mode   : Batch
Mail to           : FXOPLA (No Mail)
Days              : (SUN)
Output File       : NSCHED$FXN:SCH_FXN_LA_LOGON.LOG
Cluster_CPU       : <Ignored>              Notify user upon completion
Submit Queue      : FXN$LA$BATCH
CPULimit (x100ms) : 0                      QPriority : 100
Max_Time Warning  : None                   Job Always retained
Stall Notify      : 0 02:00:00.00          No Retry on Error
Success Count     : 11                     Failure Count : 0
Owner UIC         : [462,2]                Restart on Crash
Read Identifier   : SCH$READ
Execute Identifier: SCH$EXECUTE
Write Identifier  : SCH$WRITE
Send Opcom Completion Message
No Pre or Post Function for this job
This job has 1 local job(s) that depend upon it:
(SUN_FX_LA_START)
This job has no Dependencies on other jobs
Job Restricted to run NOT_ON Special Days, Action is to SKIP
Job Restricted by Special Days Classes:
      (FXNET_START)



    The last start time and finish time suggest that the job completed in a
    short time at 21:03.  However, if we look at the history of recorded
    runs for this job we get the following (an extract of two entries):


--------------------------------------------------------------------------------
  Job :          360 - SUN_FX_LA_FXI-LOGON                Owner : FXOPLA
  Server : Local                                          Node : PORTOS
  Login Time :       7-APR-1996 13:00:02.73               Process Id : 278004DF
  Completion Time :  7-APR-1996 13:00:25.00               Exit Status : SUCCESS

  Accounting information:
  Buffered I/0 count:           89      Peak working set size:  1077
  Direct I/O count:            138      Peak page file size:       0
  Page faults:                 903      Mounted volumes:           0
  Charged CPU time:     0 00:00:00.59   Elapsed time:     0 00:00:22.27
--------------------------------------------------------------------------------
  Job :          360 - SUN_FX_LA_FXI-LOGON                Owner : FXOPLA
  Server : Local                                          Node : PORTOS
  Login Time :      14-APR-1996 18:56:25.72               Process Id : 27C00E6B
  Completion Time : 14-APR-1996 21:03:44.88               Exit Status : SUCCESS

  Accounting information:
  Buffered I/0 count:           78      Peak working set size:  1074
  Direct I/O count:             91      Peak page file size:       0
  Page faults:                 900      Mounted volumes:           0
  Charged CPU time:     0 00:00:00.68   Elapsed time:     0 02:07:19.16
--------------------------------------------------------------------------------


    From second entry, can see that the job from "Login Time" to
    "Completion Time" was much longer than is normally expected (see first
    entry) - instead of 20 seconds, it took over 2 hours!  Not only that,
    but the login time and completion time don't match the last start and
    finish times in the SHOW JOB output. I therfore checked the Accounting
    information for the job on the node on which it ran, and this agrees
    with the job history information:


BATCH Process Termination
-------------------------

Username:          FXOPLA            UIC:               [6282FXN,FXOPLA]
Account:           6690FXN           Finish time:       14-APR-1996 21:03:47.98
Process ID:        27C00E6B          Start time:        14-APR-1996 18:56:25.72
Owner ID:                            Elapsed time:                0 02:07:22.26
Terminal name:                       Processor time:              0 00:00:00.78
Remote node addr:                    Priority:          4
Remote node name:                    Privilege <31-00>: 8114952C
Remote ID:                           Privilege <63-32>: 00000004
Queue entry:       937               Final status code: 10000001
Queue name:        FXN$LA$BATCH
Job name:          SCHEDULER_SUN_FX_LA_FXI-LOGON
Final status text: %SYSTEM-S-NORMAL, normal successful completion

Page faults:             1106        Direct IO:                116
Page fault reads:          78        Buffered IO:              100
Peak working set:        1074        Volumes mounted:            0
Peak page file:          5997        Images executed:            6

    
    The output in the logfile (tail end) is as follows:

$! SCHEDULER$SHELL.COM  -- shell for executing Scheduler jobs
$! POLYCENTER Scheduler V2.1
$GOTO Start ! Avoid writing copyright all over the place
$Start:
$!
$ nsched_saved_verify = f$verify(0)
$       if f$mode() .eqs. "BATCH" then nsched_batch_job_number = p1
$       run nsched$:scheduler$doo_command
$! SCHEDULER$SHELL.COM  -- shell for executing Scheduler jobs
$! POLYCENTER Scheduler V2.1
$GOTO Start ! Avoid writing copyright all over the place
$Start:
$!
$ nsched_saved_verify = f$verify(0)
%DCL-I-SUPERSEDE, previous value of TT has been superseded
%DCL-I-SUPERSEDE, previous value of TT has been superseded

Executing job command : @NSCHED$fxn:WIN_FXN_LOGON.COM LA, output follows:
-----------------------------------------------------------------
$       @NSCHED$fxn:WIN_FXN_LOGON.COM LA
$ !
$ Request/To=CENTRAL "AUTO-I-FXI  Please Log LA into FXI"
$ !
$ exit
$       nsched_status == $status
$       nsched_verify = f$verify(0)
-----------------------------------------------------------------
14-APR-1996 21:03:43.10  Job completion status %X10000001
-----------------------------------------------------------------
14-APR-1996 21:03:43.24  Job command completion status %X10000001
SCHEDULER$SHELL.COM: normal job termination
  FXOPLA       job terminated at 14-APR-1996 21:03:46.93

  Accounting information:
  Buffered I/O count:              96         Peak working set size:    1074
  Direct I/O count:               113         Peak page file size:      5997
  Page faults:                   1101         Mounted volumes:             0
  Charged CPU time:           0 00:00:00.75   Elapsed time:     0 02:07:21.83


    At the time of the problem (when the job appeared to be hanging), the
    customer looked at the logfile and the last line in it was the line:

$       run nsched$:scheduler$doo_command


    I then searched OPERATOR.LOG for the expected OPCOM message:

%%%%%%%%%%%  OPCOM  14-APR-1996 21:03:42.71  %%%%%%%%%%%
Message from user SYSTEM on PORTOS
Scheduler Job Started in Batch Queue, Name: SUN_FX_LA_FXI-LOGON (Job #360)
Entry 937 Started in FXN$LA$BATCH

%%%%%%%%%%%  OPCOM  14-APR-1996 21:03:42.86  %%%%%%%%%%%
Message from user FXOPLA on PORTOS
BATCH_937, AUTO-I-FXI  Please Log LA into FXI

    This suggests that although the job was initially started at
    18:56:25.72 (from history record and batch accounting record),
    Scheduler did not start the job in the batch queue until 21:03
    naccording to the OPCOM notification.

    During the "hang" condition, this particular job was in a "Queued"
    state, with a requested state of "Release", while the batch job itself
    was "Executing".

    Does this indicate communication problems between Scheduler processes
    in the cluster?  This is all I can think it could be.  I don't have the
    information on the other detached scheduler job to hand, but know that
    it runs on the same node (PORTOS) as this batch job.  I do not know
    which node was the default Scheduler at the time, although PORTOS is
    currently the default node and the customer believes that it was at the
    time of the problem.

    Unfortunately, the customer does not have debug turned on by default,
    so I have no useful information from that source.  Is turning on debug
    the only way forward?

T.R	Title	User	Date	Lines
1069.1		MUZICK::RCOLLIER	`Tue Apr 16 1996 07:46`	12
	Have you tried: SCHED SHOW DELETES - if over 100 then database needs compressing since this could cause performance problems. dir/size the HISTORY LOG - if over 3000 blocks then need to close it and open a new log file if neither one of these work, please also check the install guide and make sure that the system parameters are set to the recommended values.
1069.2	Log file size?	KERNEL::TITCOMBER	`Wed Apr 17 1996 05:26`	30
	Thanks for the reply. SCHED SHOW DELETES looks OK: >> Number of Job deletions since last compress = 44 Looks like the history log needs to be closed and a new one opened: >> VERMONT_CREAMERY.LOG;1 19700/20200 I have checked the SYSGEN parameters and the UAF quotas, and all are as recommended in the installation guide with the exception of the SYSGEN SRP paramters: Parameter Name Current Default Min. Max. Unit Dynamic -------------- ------- ------- ------- ------- ---- ------- SRPCOUNT 16242 120 0 4194304 Packets SRPCOUNTV 103710 250 0 4194304 Packets However, there is no SRP pool expansion on the system, so don't believe this to be a problem. Therefore, the only issue that might be relevant is the history log file size. The customer will address this issue, and monitor the system closely for a week or so before getting back to me. Thanks for your assistance.