[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:	SCHEDULER
Notice:	Welcome to the Scheduler Conference on node HUMANEril
Moderator:	RUMOR::FALEK

Created:	Sat Mar 20 1993
Last Modified:	Tue Jun 03 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1240
Total number of notes:	5017

1114.0. "Job in "NotRunning" state after restart of Scheduler" by KERNEL::TITCOMBER () Wed Jun 05 1996 11:54

    I have a customer with a problem as detailed below.  Any help would be
    greatly appreciated.


    OpenVMS V6.1
    Scheduler V2.1B-1

    The customer has a site-written procedure NSCHED_DUMP.COM which does
    the following:

    (a) shuts down the Scheduler

    (b) submits a batch job to restart the Scheduler 10 minutes later

    (c) copies all the Scheduler files off to another directory for
    subsequent backup

    (d) waits for 15 minutes until after the Scheduler has restarted

    (e) closes the Scheduler log file (only some days of the week)

    (f) exits.

    This procedure runs each day on their VAXcluster running VMS 6.1 with
    no problems.

    However, they have transferred the same job to an Alpha (standalone),
    where the job goes into a NotRunning state every time it is executed: 

GOOFY>sched sh j 9

Job Name             Entry    User_name    State      Next Run Time
--------             -----    ---------    -----      -------------
NSCHED_DUMP          9        SYSTEM       NotRunning  5-JUN-1996 18:00
VMS_Command :  @SITE_COM:NSCHED_DUMP.COM
Group : (none)                             Type : (none)
Comment : Daily job to dump the scheduler files
Last Start Time   :  4-JUN-1996 18:00
Last Finish Time  :  4-JUN-1996 09:13      Last Exit Status : 0C9681EC(ERROR)
Error : %NSCHED-F-JOBABORT, Job was aborted
Schedule Interval : D 18:00                Mode   : Detached
Mail to           : TCRUSE (on ERROR)
Days              : (MON,TUE,WED,THU,FRI)
Output File       : sys$common:[sysmgr]nsched_dump.log
Cluster_CPU       : Default                User not notified on completion

    It appears that when the Scheduler restarts it does not recognise the
    fact that the job is still running and so puts it into the NotRunning
    state.

    The customer set DEBUG ON in Scheduler. The following extract is from
    the  debug log when the Scheduler is re-started and contains the only
    reference to NSCHED_DUMP (9). 


Job  8  is scheduled for the past - check pre-requisites

All Deps must have completed with success later than
 4-JUN-1996 17:56:05.98
calling RUN_TASK to run job  8
vss$get_next_start_time: 1  cstat= 211191443  next=
 4-JUN-1996 18:15:09.00
Running Job  8  PID=0000316B Count= 1  Priority= 4
06:10 PM  processing record #  9  status= R   request=
06:10 PM  updated    record #  9  status= R    request=
06:10 PM  processing record #  10  status= S   request=
 Now= 4-JUN-1996 18:10:09.61   job_sched_time= 5-JUN-1996 00:05:00.00
job  10  is scheduled for the future
06:10 PM  updated    record #  10  status= S    request=

    Both the Vax and the Alpha are running Scheduler V2.1B-1.

    So, why does this Scheduler job go into the Not Running state?  Is it
    expected behaviour?

    Thanks in advance,

    Rich

T.R	Title	User	Personal Name	Date	Lines
1114.1	Backup scheduler with script command	CMGOP2::meod22dgp4.gen.meo.dec.com::mckenzie	--> dangling pointer	`Wed Jun 05 1996 20:48`	7
	You can get the same effect (ability to recreate the database) without closing down the scheduler with schedule script job/all FWIW
1114.2	...but why "notRunning"	KERNEL::TITCOMBER		`Fri Jun 07 1996 07:33`	6
	Thanks for that, but what I really need to know is why the job goes into the "Not Running" state. Is it expected, or is it a problem? Rich
1114.3	not running pops up hardly on a normal system	HLFS00::ERIC_S	Eric Sonneveld MCS - B.O. IS Holland	`Sat Jun 08 1996 04:23`	13
	Not running means scheduler didn't get the PID details (yet). This shouldn't take more than a few seconds, most times you do not see it. It also happens at image rundown, when the pid is gone and Scheduler has to update the status in the scheduler database. I've no sources of whatever, but just from experience I remember to have seen this state at startup/rundown of a scheduler job. In a normal situation (no bad performance on the system) it should hardly be seen, if it does it indicates a bad performance on the system and/or scheduler itself. Eric
1114.4	NotRunning - more explanation	RUMOR::FALEK	ex-TU58 King	`Mon Jun 10 1996 14:53`	19
	The Scheduler DCL interface checks the PID associated with scheduler jobs that are marked as currently "running" in the scheduler's database (disk file), and if the PID isn't found, it shows the job as "NotRunning" rather than "Running". If you stop the scheduler while jobs are running, jobs continue to run and complete normally (unless you did $ sched stop /abort) but since the scheduler isn't running, the disk database isn't updated. The user interface would then indicate "NotRunning" for the job process. When the scheduler is restarted, it checks its database and cleans up any jobs whose PIDs aren't really there (it would probably report the job completion status as failure "NSCHED-F-job was aborted" since it can't tell if the completion was normal, as it missed the mailbox termination message). In a multi-node cluster, this won't happen if the scheduler is running on another node, and the job is not restricted to running on a node that is down, since the "default" scheduler will take over responsibility for tracking jobs and updating the status on disk..
1114.5		RUMOR::FALEK	ex-TU58 King	`Mon Jun 10 1996 14:57`	6
	PS if jobs stay in NotRunning state (as reported by the GUI) after the scheduler is restarted on the node the job was running on (or started on any node of a VMScluster, if the job is not restricted to a particular node), and the scheduler has had time to read thru the database and has reached more-or-less steady state, then something is wrong - since the scheduler should update the job's status.
1114.6	Fixed with V2.1B-7	KERNEL::TITCOMBER		`Fri Jul 05 1996 08:35`	9
	Thanks for all the quality explanations and help. The problem was resolved by upgrading to V2.1B-7 - the problem no longer exists, despite many re-runs of the job. Thanks again, Rich