| Title: | SCHEDULER |
| Notice: | Welcome to the Scheduler Conference on node HUMANE ril |
| Moderator: | RUMOR::FALEK |
| Created: | Sat Mar 20 1993 |
| Last Modified: | Tue Jun 03 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 1240 |
| Total number of notes: | 5017 |
I have a customer with a problem as detailed below. Any help would be
greatly appreciated.
OpenVMS V6.1
Scheduler V2.1B-1
The customer has a site-written procedure NSCHED_DUMP.COM which does
the following:
(a) shuts down the Scheduler
(b) submits a batch job to restart the Scheduler 10 minutes later
(c) copies all the Scheduler files off to another directory for
subsequent backup
(d) waits for 15 minutes until after the Scheduler has restarted
(e) closes the Scheduler log file (only some days of the week)
(f) exits.
This procedure runs each day on their VAXcluster running VMS 6.1 with
no problems.
However, they have transferred the same job to an Alpha (standalone),
where the job goes into a NotRunning state every time it is executed:
GOOFY>sched sh j 9
Job Name Entry User_name State Next Run Time
-------- ----- --------- ----- -------------
NSCHED_DUMP 9 SYSTEM NotRunning 5-JUN-1996 18:00
VMS_Command : @SITE_COM:NSCHED_DUMP.COM
Group : (none) Type : (none)
Comment : Daily job to dump the scheduler files
Last Start Time : 4-JUN-1996 18:00
Last Finish Time : 4-JUN-1996 09:13 Last Exit Status : 0C9681EC(ERROR)
Error : %NSCHED-F-JOBABORT, Job was aborted
Schedule Interval : D 18:00 Mode : Detached
Mail to : TCRUSE (on ERROR)
Days : (MON,TUE,WED,THU,FRI)
Output File : sys$common:[sysmgr]nsched_dump.log
Cluster_CPU : Default User not notified on completion
It appears that when the Scheduler restarts it does not recognise the
fact that the job is still running and so puts it into the NotRunning
state.
The customer set DEBUG ON in Scheduler. The following extract is from
the debug log when the Scheduler is re-started and contains the only
reference to NSCHED_DUMP (9).
Job 8 is scheduled for the past - check pre-requisites
All Deps must have completed with success later than
4-JUN-1996 17:56:05.98
calling RUN_TASK to run job 8
vss$get_next_start_time: 1 cstat= 211191443 next=
4-JUN-1996 18:15:09.00
Running Job 8 PID=0000316B Count= 1 Priority= 4
06:10 PM processing record # 9 status= R request=
06:10 PM updated record # 9 status= R request=
06:10 PM processing record # 10 status= S request=
Now= 4-JUN-1996 18:10:09.61 job_sched_time= 5-JUN-1996 00:05:00.00
job 10 is scheduled for the future
06:10 PM updated record # 10 status= S request=
Both the Vax and the Alpha are running Scheduler V2.1B-1.
So, why does this Scheduler job go into the Not Running state? Is it
expected behaviour?
Thanks in advance,
Rich
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 1114.1 | Backup scheduler with script command | CMGOP2::meod22dgp4.gen.meo.dec.com::mckenzie | --> dangling pointer | Wed Jun 05 1996 20:48 | 7 |
You can get the same effect (ability to recreate the database) without closing down the scheduler with schedule script job/all FWIW | |||||
| 1114.2 | ...but why "notRunning" | KERNEL::TITCOMBER | Fri Jun 07 1996 07:33 | 6 | |
Thanks for that, but what I really need to know is why the job goes
into the "Not Running" state. Is it expected, or is it a problem?
Rich
| |||||
| 1114.3 | not running pops up hardly on a normal system | HLFS00::ERIC_S | Eric Sonneveld MCS - B.O. IS Holland | Sat Jun 08 1996 04:23 | 13 |
Not running means scheduler didn't get the PID details (yet). This shouldn't take more than a few seconds, most times you do not see it. It also happens at image rundown, when the pid is gone and Scheduler has to update the status in the scheduler database. I've no sources of whatever, but just from experience I remember to have seen this state at startup/rundown of a scheduler job. In a normal situation (no bad performance on the system) it should hardly be seen, if it does it indicates a bad performance on the system and/or scheduler itself. Eric | |||||
| 1114.4 | NotRunning - more explanation | RUMOR::FALEK | ex-TU58 King | Mon Jun 10 1996 14:53 | 19 |
The Scheduler DCL interface checks the PID associated with scheduler
jobs that are marked as currently "running" in the scheduler's
database (disk file), and if the PID isn't found, it shows the job as
"NotRunning" rather than "Running".
If you stop the scheduler while jobs are running, jobs continue to run
and complete normally (unless you did $ sched stop /abort) but since the
scheduler isn't running, the disk database isn't updated. The user
interface would then indicate "NotRunning" for the job process. When the
scheduler is restarted, it checks its database and cleans up any jobs
whose PIDs aren't really there (it would probably report the job
completion status as failure "NSCHED-F-job was aborted" since it can't
tell if the completion was normal, as it missed the mailbox termination
message).
In a multi-node cluster, this won't happen if the scheduler is running on
another node, and the job is not restricted to running on a node that
is down, since the "default" scheduler will take over responsibility for
tracking jobs and updating the status on disk..
| |||||
| 1114.5 | RUMOR::FALEK | ex-TU58 King | Mon Jun 10 1996 14:57 | 6 | |
PS if jobs stay in NotRunning state (as reported by the GUI) after the
scheduler is restarted on the node the job was running on (or started
on any node of a VMScluster, if the job is not restricted to a particular
node), and the scheduler has had time to read thru the database and has
reached more-or-less steady state, then something is wrong - since the
scheduler should update the job's status.
| |||||
| 1114.6 | Fixed with V2.1B-7 | KERNEL::TITCOMBER | Fri Jul 05 1996 08:35 | 9 | |
Thanks for all the quality explanations and help. The problem was
resolved by upgrading to V2.1B-7 - the problem no longer exists,
despite many re-runs of the job.
Thanks again,
Rich
| |||||