[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | SCHEDULER |
Notice: | Welcome to the Scheduler Conference on node HUMANE ril |
Moderator: | RUMOR::FALEK |
|
Created: | Sat Mar 20 1993 |
Last Modified: | Tue Jun 03 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1240 |
Total number of notes: | 5017 |
1069.0. "Scheduler hang(?) in SCHEDULER$DOO_COMMAND" by KERNEL::TITCOMBER () Tue Apr 16 1996 06:27
Help! Can anybody help to shed some light on this problem?
I have a customer who has experienced a problem with Scheduler hanging
a couple of times in the last week or so. When I say hanging, I
believe that to be the case, but am open to other suggestions.
The customer notices the problem when jobs don't seem to complete at
the expected time, or don't repeat themselves at the specified
interval. In the particular case of the last occurrence, one batch job
had a total elapsed time from start to finish of 2 hours (from SCHED
SHOW HISTORY/RECORDED_RUNS). The job normally takes 20 seconds or so!
Another detached job is scheduled to run every 3 minutes, and yet SCHED
SHOW HISTORY/RECORDED_RUNS shows a period of time where the job didn't
run for 40 mins or so!
Unfortunately(?), the condition eventually clears itself and as a
result we have been unable to look at the system while in the problem
state. However, I have some output that shows some useful information
about the batch job in particular.
I have attached the output with more detailed explanations below.
Any help or ideas would be greatly appreciated.
Rich
The customer is running Scheduler V2.1B-7 on a cluster of VAX systems
running a mix of V5.5-2 and V6.1
Here are the job details:
Job Name Entry User_name State Next Run Time
-------- ----- --------- ----- -------------
SUN_FX_LA_FXI-LOGON 360 FXOPLA Scheduled 21-APR-1996 13:00 [NXT]
VMS_Command : @NSCHED$fxn:WIN_FXN_LOGON.COM LA
Group : FXNET Type : SUNSTART
Comment : INFORM OPS TO START FXI LINKS
Last Start Time : 14-APR-1996 21:03
Last Finish Time : 14-APR-1996 21:03 Last Exit Status : SUCCESS
Schedule Interval : D 13:00:00 Mode : Batch
Mail to : FXOPLA (No Mail)
Days : (SUN)
Output File : NSCHED$FXN:SCH_FXN_LA_LOGON.LOG
Cluster_CPU : <Ignored> Notify user upon completion
Submit Queue : FXN$LA$BATCH
CPULimit (x100ms) : 0 QPriority : 100
Max_Time Warning : None Job Always retained
Stall Notify : 0 02:00:00.00 No Retry on Error
Success Count : 11 Failure Count : 0
Owner UIC : [462,2] Restart on Crash
Read Identifier : SCH$READ
Execute Identifier: SCH$EXECUTE
Write Identifier : SCH$WRITE
Send Opcom Completion Message
No Pre or Post Function for this job
This job has 1 local job(s) that depend upon it:
(SUN_FX_LA_START)
This job has no Dependencies on other jobs
Job Restricted to run NOT_ON Special Days, Action is to SKIP
Job Restricted by Special Days Classes:
(FXNET_START)
The last start time and finish time suggest that the job completed in a
short time at 21:03. However, if we look at the history of recorded
runs for this job we get the following (an extract of two entries):
--------------------------------------------------------------------------------
Job : 360 - SUN_FX_LA_FXI-LOGON Owner : FXOPLA
Server : Local Node : PORTOS
Login Time : 7-APR-1996 13:00:02.73 Process Id : 278004DF
Completion Time : 7-APR-1996 13:00:25.00 Exit Status : SUCCESS
Accounting information:
Buffered I/0 count: 89 Peak working set size: 1077
Direct I/O count: 138 Peak page file size: 0
Page faults: 903 Mounted volumes: 0
Charged CPU time: 0 00:00:00.59 Elapsed time: 0 00:00:22.27
--------------------------------------------------------------------------------
Job : 360 - SUN_FX_LA_FXI-LOGON Owner : FXOPLA
Server : Local Node : PORTOS
Login Time : 14-APR-1996 18:56:25.72 Process Id : 27C00E6B
Completion Time : 14-APR-1996 21:03:44.88 Exit Status : SUCCESS
Accounting information:
Buffered I/0 count: 78 Peak working set size: 1074
Direct I/O count: 91 Peak page file size: 0
Page faults: 900 Mounted volumes: 0
Charged CPU time: 0 00:00:00.68 Elapsed time: 0 02:07:19.16
--------------------------------------------------------------------------------
From second entry, can see that the job from "Login Time" to
"Completion Time" was much longer than is normally expected (see first
entry) - instead of 20 seconds, it took over 2 hours! Not only that,
but the login time and completion time don't match the last start and
finish times in the SHOW JOB output. I therfore checked the Accounting
information for the job on the node on which it ran, and this agrees
with the job history information:
BATCH Process Termination
-------------------------
Username: FXOPLA UIC: [6282FXN,FXOPLA]
Account: 6690FXN Finish time: 14-APR-1996 21:03:47.98
Process ID: 27C00E6B Start time: 14-APR-1996 18:56:25.72
Owner ID: Elapsed time: 0 02:07:22.26
Terminal name: Processor time: 0 00:00:00.78
Remote node addr: Priority: 4
Remote node name: Privilege <31-00>: 8114952C
Remote ID: Privilege <63-32>: 00000004
Queue entry: 937 Final status code: 10000001
Queue name: FXN$LA$BATCH
Job name: SCHEDULER_SUN_FX_LA_FXI-LOGON
Final status text: %SYSTEM-S-NORMAL, normal successful completion
Page faults: 1106 Direct IO: 116
Page fault reads: 78 Buffered IO: 100
Peak working set: 1074 Volumes mounted: 0
Peak page file: 5997 Images executed: 6
The output in the logfile (tail end) is as follows:
$! SCHEDULER$SHELL.COM -- shell for executing Scheduler jobs
$! POLYCENTER Scheduler V2.1
$GOTO Start ! Avoid writing copyright all over the place
$Start:
$!
$ nsched_saved_verify = f$verify(0)
$ if f$mode() .eqs. "BATCH" then nsched_batch_job_number = p1
$ run nsched$:scheduler$doo_command
$! SCHEDULER$SHELL.COM -- shell for executing Scheduler jobs
$! POLYCENTER Scheduler V2.1
$GOTO Start ! Avoid writing copyright all over the place
$Start:
$!
$ nsched_saved_verify = f$verify(0)
%DCL-I-SUPERSEDE, previous value of TT has been superseded
%DCL-I-SUPERSEDE, previous value of TT has been superseded
Executing job command : @NSCHED$fxn:WIN_FXN_LOGON.COM LA, output follows:
-----------------------------------------------------------------
$ @NSCHED$fxn:WIN_FXN_LOGON.COM LA
$ !
$ Request/To=CENTRAL "AUTO-I-FXI Please Log LA into FXI"
$ !
$ exit
$ nsched_status == $status
$ nsched_verify = f$verify(0)
-----------------------------------------------------------------
14-APR-1996 21:03:43.10 Job completion status %X10000001
-----------------------------------------------------------------
14-APR-1996 21:03:43.24 Job command completion status %X10000001
SCHEDULER$SHELL.COM: normal job termination
FXOPLA job terminated at 14-APR-1996 21:03:46.93
Accounting information:
Buffered I/O count: 96 Peak working set size: 1074
Direct I/O count: 113 Peak page file size: 5997
Page faults: 1101 Mounted volumes: 0
Charged CPU time: 0 00:00:00.75 Elapsed time: 0 02:07:21.83
At the time of the problem (when the job appeared to be hanging), the
customer looked at the logfile and the last line in it was the line:
$ run nsched$:scheduler$doo_command
I then searched OPERATOR.LOG for the expected OPCOM message:
%%%%%%%%%%% OPCOM 14-APR-1996 21:03:42.71 %%%%%%%%%%%
Message from user SYSTEM on PORTOS
Scheduler Job Started in Batch Queue, Name: SUN_FX_LA_FXI-LOGON (Job #360)
Entry 937 Started in FXN$LA$BATCH
%%%%%%%%%%% OPCOM 14-APR-1996 21:03:42.86 %%%%%%%%%%%
Message from user FXOPLA on PORTOS
BATCH_937, AUTO-I-FXI Please Log LA into FXI
This suggests that although the job was initially started at
18:56:25.72 (from history record and batch accounting record),
Scheduler did not start the job in the batch queue until 21:03
naccording to the OPCOM notification.
During the "hang" condition, this particular job was in a "Queued"
state, with a requested state of "Release", while the batch job itself
was "Executing".
Does this indicate communication problems between Scheduler processes
in the cluster? This is all I can think it could be. I don't have the
information on the other detached scheduler job to hand, but know that
it runs on the same node (PORTOS) as this batch job. I do not know
which node was the default Scheduler at the time, although PORTOS is
currently the default node and the customer believes that it was at the
time of the problem.
Unfortunately, the customer does not have debug turned on by default,
so I have no useful information from that source. Is turning on debug
the only way forward?
T.R | Title | User | Personal Name | Date | Lines |
---|
1069.1 | | MUZICK::RCOLLIER | | Tue Apr 16 1996 08:46 | 12 |
| Have you tried:
SCHED SHOW DELETES - if over 100 then database needs compressing since this
could cause performance problems.
dir/size the HISTORY LOG - if over 3000 blocks then need to close it
and open a new log file
if neither one of these work, please also check the install guide and
make sure that the system parameters are set to the recommended values.
|
1069.2 | Log file size? | KERNEL::TITCOMBER | | Wed Apr 17 1996 06:26 | 30 |
|
Thanks for the reply.
SCHED SHOW DELETES looks OK:
>> Number of Job deletions since last compress = 44
Looks like the history log needs to be closed and a new one opened:
>> VERMONT_CREAMERY.LOG;1 19700/20200
I have checked the SYSGEN parameters and the UAF quotas, and all are as
recommended in the installation guide with the exception of the SYSGEN
SRP paramters:
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
SRPCOUNT 16242 120 0 4194304 Packets
SRPCOUNTV 103710 250 0 4194304 Packets
However, there is no SRP pool expansion on the system, so don't believe
this to be a problem.
Therefore, the only issue that might be relevant is the history log
file size. The customer will address this issue, and monitor the
system closely for a week or so before getting back to me.
Thanks for your assistance.
|