[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:SCHEDULER
Notice:Welcome to the Scheduler Conference on node HUMANEril
Moderator:RUMOR::FALEK
Created:Sat Mar 20 1993
Last Modified:Tue Jun 03 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1240
Total number of notes:5017

1114.0. "Job in "NotRunning" state after restart of Scheduler" by KERNEL::TITCOMBER () Wed Jun 05 1996 12:54

    I have a customer with a problem as detailed below.  Any help would be
    greatly appreciated.


    OpenVMS V6.1
    Scheduler V2.1B-1

    The customer has a site-written procedure NSCHED_DUMP.COM which does
    the following:

    (a) shuts down the Scheduler

    (b) submits a batch job to restart the Scheduler 10 minutes later

    (c) copies all the Scheduler files off to another directory for
    subsequent backup

    (d) waits for 15 minutes until after the Scheduler has restarted

    (e) closes the Scheduler log file (only some days of the week)

    (f) exits.

    This procedure runs each day on their VAXcluster running VMS 6.1 with
    no problems.

    However, they have transferred the same job to an Alpha (standalone),
    where the job goes into a NotRunning state every time it is executed: 

GOOFY>sched sh j 9

Job Name             Entry    User_name    State      Next Run Time
--------             -----    ---------    -----      -------------
NSCHED_DUMP          9        SYSTEM       NotRunning  5-JUN-1996 18:00
VMS_Command :  @SITE_COM:NSCHED_DUMP.COM
Group : (none)                             Type : (none)
Comment : Daily job to dump the scheduler files
Last Start Time   :  4-JUN-1996 18:00
Last Finish Time  :  4-JUN-1996 09:13      Last Exit Status : 0C9681EC(ERROR)
Error : %NSCHED-F-JOBABORT, Job was aborted
Schedule Interval : D 18:00                Mode   : Detached
Mail to           : TCRUSE (on ERROR)
Days              : (MON,TUE,WED,THU,FRI)
Output File       : sys$common:[sysmgr]nsched_dump.log
Cluster_CPU       : Default                User not notified on completion

    It appears that when the Scheduler restarts it does not recognise the
    fact that the job is still running and so puts it into the NotRunning
    state.

    The customer set DEBUG ON in Scheduler. The following extract is from
    the  debug log when the Scheduler is re-started and contains the only
    reference to NSCHED_DUMP (9). 


Job  8  is scheduled for the past - check pre-requisites

All Deps must have completed with success later than
 4-JUN-1996 17:56:05.98
calling RUN_TASK to run job  8
vss$get_next_start_time: 1  cstat= 211191443  next=
 4-JUN-1996 18:15:09.00
Running Job  8  PID=0000316B Count= 1  Priority= 4
06:10 PM  processing record #  9  status= R   request=
06:10 PM  updated    record #  9  status= R    request=
06:10 PM  processing record #  10  status= S   request=
 Now= 4-JUN-1996 18:10:09.61   job_sched_time= 5-JUN-1996 00:05:00.00
job  10  is scheduled for the future
06:10 PM  updated    record #  10  status= S    request=

    Both the Vax and the Alpha are running Scheduler V2.1B-1.

    So, why does this Scheduler job go into the Not Running state?  Is it
    expected behaviour?

    Thanks in advance,

    Rich

    
T.RTitleUserPersonal
Name
DateLines
1114.1Backup scheduler with script commandCMGOP2::meod22dgp4.gen.meo.dec.com::mckenzie--> dangling pointerWed Jun 05 1996 21:487
You can get the same effect (ability to recreate the 
database) without closing down the scheduler with

schedule script job/all

FWIW
1114.2...but why "notRunning"KERNEL::TITCOMBERFri Jun 07 1996 08:336
    
    Thanks for that, but what I really need to know is why the job goes
    into the "Not Running" state.  Is it expected, or is it a problem?
    
    Rich
    
1114.3not running pops up hardly on a normal systemHLFS00::ERIC_SEric Sonneveld MCS - B.O. IS HollandSat Jun 08 1996 05:2313
Not running means scheduler didn't get the PID details (yet). This shouldn't
take more than a few seconds, most times you do not see it.
It also happens at image rundown, when the pid is gone and Scheduler has to
update the status in the scheduler database. 

I've no sources of whatever, but just from experience I remember to have seen
this state at startup/rundown of a scheduler job.

In a normal situation (no bad performance on the system) it should hardly be
seen, if it does it indicates a bad performance on the system and/or scheduler
itself.

Eric
1114.4NotRunning - more explanationRUMOR::FALEKex-TU58 KingMon Jun 10 1996 15:5319
    The Scheduler DCL interface checks the PID associated with scheduler
    jobs that are marked as currently "running" in the scheduler's
    database (disk file), and if the PID isn't found, it shows the job as
    "NotRunning" rather than "Running".
     
    If you stop the scheduler while jobs are running, jobs continue to run
    and complete normally (unless you did $ sched stop /abort) but since the
    scheduler isn't running, the disk database isn't updated. The user
    interface would then indicate "NotRunning" for the job process. When the
    scheduler is restarted, it checks its database and cleans up any jobs
    whose PIDs aren't really there (it would probably report the job
    completion status as failure "NSCHED-F-job was aborted" since it can't
    tell if the completion was normal, as it missed the mailbox termination
    message). 
    
    In a multi-node cluster, this won't happen if the scheduler is running on
    another node, and the job is not restricted to running on a node that
    is down, since the "default"  scheduler will take over responsibility for
    tracking jobs and updating the status on disk..
1114.5RUMOR::FALEKex-TU58 KingMon Jun 10 1996 15:576
    PS if jobs stay in NotRunning state (as reported by the GUI) after the
    scheduler is restarted on the node the job was running on (or started
    on any node of a VMScluster, if the job is not restricted to a particular
    node), and the scheduler has had time to read thru the database and has
    reached more-or-less steady state, then something is wrong - since the
    scheduler should update the job's status. 
1114.6Fixed with V2.1B-7KERNEL::TITCOMBERFri Jul 05 1996 09:359
    
    Thanks for all the quality explanations and help.  The problem was
    resolved by upgrading to V2.1B-7 - the problem no longer exists,
    despite many re-runs of the job.
    
    Thanks again,
    
    Rich