[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | SCHEDULER |
Notice: | Welcome to the Scheduler Conference on node HUMANE ril |
Moderator: | RUMOR::FALEK |
|
Created: | Sat Mar 20 1993 |
Last Modified: | Tue Jun 03 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1240 |
Total number of notes: | 5017 |
1219.0. "SCHEDULER goes into DEP_WAIT state" by REFDV1::DAVIES () Tue Feb 18 1997 07:49
I logged the following call with the hotline, but thought someone in here
might have some insight. Ever since we upgraded CASPRO, to SEPS97,
we are encountering the following scheduler problem on a daily basis.
Anyone have any ideas...
tks,
Judy
From: REFDV1::DAVIES 12-FEB-1997 16:46:20.33
To: NIOPS::FLYNN,PENUTS::EMOTTOLO
CC: DAVIES
Subj: DSPS scheduler problem
Bob/Evelyn
I wanted to follow up on the SCHEDULER problem that occured last night with
DSPS jobs going into a DEP_WAIT state. Bob, in our phone conversation you
mentioned logging a call to colorado to see if they have heard of the
problem. I wanted to document exactly what happened so you can tell them
or add it to the current open target call. I don't have the log #.
Evelyn I am copying you on this mail message to make you aware of the problem
also. It is critical until these problems are fixed with the DECscheduler,
that you (and everyone you work with), ensure these jobs finish.
The details of the problem...
As Susan mentioned in the attached mail message, after PP61_DSPSPROD
completed succesfully it went into a DEP_WAIT state. Because it was in
a DEP_WAIT state it didn't kick off PP70_DSPSPROD. When John Healey
and I looked at the scheduler entries, PP61_DSPSPROD didn't have any
local jobs depending on it. (It should have had PP70_DSPSPROD).
The dependency link appeared to be broken.
Below is the state it was in prior to us "RESYNCHing" the 2 jobs.
casv05::ref_support> sched sh job/full pp60*=dspsprod
Job Name Entry User_name State Next Run Time
-------- ----- --------- ----- -------------
PP61_DSPSPROD 166 DSPSPROD Scheduled 12-FEB-1997 19:20
VMS_Command : @DSPS$COMMAND:PP61_DSPSPROD
Group : DSPSPROD Type : DAILY
Comment : PRIO 1..REFERENCE SUPPORT..;CALL SUPPORT IF JOB RUNNING @11PM..Table f
ile
Last Start Time : 11-FEB-1997 19:30
Last Finish Time : 11-FEB-1997 19:37 Last Exit Status : SUCCESS
Schedule Interval : None Mode : Detached
Mail to : CASV05::DSPSPROD (Always)
Days : None
Output File : DSPS$LOGS:PP61_DSPSPROD.LOG
Cluster_CPU : Default Notify user upon completion
Run Priority : Default
Max_Time Warning : None Job Always retained
Stall Notify : None No Retry on Error
Success Count : 941 Failure Count : 250
Owner UIC : [24,1] No Restart on Crash
Send Opcom Completion Message
No Pre or Post Function for this job
No local jobs depend upon this job.
All dependencies must successfully complete after: 12-FEB-1997 14:25:18.32
Job Dependencies: (APR_ACMS_STOP)
casv05::ref_support> sched sh job/full pp70*=dspsprod
Job Name Entry User_name State Next Run Time
-------- ----- --------- ----- -------------
PP70_DSPSPROD 174 DSPSPROD Scheduled 12-FEB-1997 19:30
VMS_Command : @DSPS$COMMAND:PP70_DSPSPROD.COM
Group : DSPSPROD Type : DAILY
Comment : PRIO 1..REFERENCE SUPPORT..DSPS Extract/Copy
Last Start Time : 12-FEB-1997 12:12
Last Finish Time : 12-FEB-1997 12:13 Last Exit Status : SUCCESS
Schedule Interval : None Mode : Detached
Mail to : CASV05::DSPSPROD (Always)
Days : None
Output File : DSPS$LOGS:PP70_DSPSPROD.LOG
Cluster_CPU : Default Notify user upon completion
Run Priority : Default
Max_Time Warning : None Job Always retained
Stall Notify : None No Retry on Error
Success Count : 942 Failure Count : 0
Owner UIC : [24,1] No Restart on Crash
Send Opcom Completion Message
No Pre or Post Function for this job
No local jobs depend upon this job.
All dependencies must successfully complete after: 12-FEB-1997 14:25:20.82
Job Dependencies: (PP61_DSPSPROD)
We then decided to try to add back the dependency and it didn't work. So
we took off the dependency and added it back on. Interesting enough,
we got the below warning:
casv05::ref_support> sched mod/synch=(pp61_dspsprod=dspsprod)
pp70_dspsprod=dspsprod
%NSCHED-I-NOMODS, Job PP70_DSPSPROD - No fields modified
casv05::ref_support> sched mod/nosynch pp70_dspsprod=dspsprod
%NSCHED-I-RQSTSUCCSS, Job PP70_DSPSPROD - Modified
%NSCHED-W-NOSCHED, No scheduler available to service request
But it put back on the dependency anyway. So, there seems to be some
strange things going on with DECSCHEDULER post SEPS97. Hopefully it
will be o.k. for tonights processing.
This is the way PP61_DSPSPROD now looks...
DSPS_Judy> sched sho job pp61_dspsprod/full
Job Name Entry User_name State Next Run Time
-------- ----- --------- ----- -------------
PP61_DSPSPROD 166 DSPSPROD Scheduled 12-FEB-1997 19:20
VMS_Command : @DSPS$COMMAND:PP61_DSPSPROD
Group : DSPSPROD Type : DAILY
Comment : PRIO 1..REFERENCE SUPPORT..;CALL SUPPORT IF JOB RUNNING @11PM..Table f
ile
Last Start Time : 11-FEB-1997 19:30
Last Finish Time : 11-FEB-1997 19:37 Last Exit Status : SUCCESS
Schedule Interval : None Mode : Detached
Mail to : CASV05::DSPSPROD (Always)
Days : None
Output File : DSPS$LOGS:PP61_DSPSPROD.LOG
Cluster_CPU : Default Notify user upon completion
Run Priority : Default
Max_Time Warning : None Job Always retained
Stall Notify : None No Retry on Error
Success Count : 941 Failure Count : 250
Owner UIC : [24,1] No Restart on Crash
Send Opcom Completion Message
No Pre or Post Function for this job
This job has 1 local job(s) that depend upon it:
(PP70_DSPSPROD)
All dependencies must successfully complete after: 12-FEB-1997 14:25:18.32
Job Dependencies: (APR_ACMS_STOP)
DSPS_Judy>
Tks,
Judy
From: REFDV1::VINCENT "mach nicht" 12-FEB-1997 10:28:00.79
To: USOPS::MPR
CC: DAVIES,MURPHY,HEALEY,VINCENT
Subj: Please log *URGENT* call for problem with DECScheduler on CASPRO
Hello,
Refer to attatched mail message for description of events that occurred
last night with production on cluster CASPRO. This is apparently not the first
time since the upgrade on Thursday that sched jobs have gone into the black
hole of DEP_WAIT state. At a minimum, data center nightly support should
begin looking for / anticipating this problem and should page reference support
asap if this happens. Someone should also be looking into why the scheduler
is doing this. Please have assigned person call me as soon as possible.
Susan Vincent, 227-3776
From: REFDV1::VINCENT "mach nicht" 12-FEB-1997 08:56:18.29
To: DASSS1::NORTON
CC: VINCENT
Subj: bunch of dsps jobs didn't run last night, they are in dep_wait state
Hi Joy,
Fyi, half of dsps's nightly production didn't run last night. All of the
jobs that did run, ran successfully. I've checked out the dependencies based
on the schedule produced yesterday and everything is right. pp70_dspsprod
was the next thing that was supposed to run, but it is in dep_wait, even though
the job it was dependent upon (pp61_dspsprod) completed successfully. I was
on beeper last night and didn't get paged, but I guess no one would page me
since no jobs that actually ran failed. I'm working with pricing ops to do
what is necessary to reschedule jobs and run jobs that need to run today, but
I'm not sure what to do about the real problem -- why are these jobs still in
dep_wait? Any ideas?
Susan
T.R | Title | User | Personal Name | Date | Lines |
---|
1219.1 | corrupt DEPENDENCIES.DAT file | REFDV1::DAVIES | | Wed Feb 19 1997 11:36 | 26 |
| I'll answer my own note, Colorado found out it was a corrupt file, causing
our scheduler problem.
From: NIOPS::FLYNN "Bob Flynn - CCS Platform Management - DTN 264-7632" 18-FEB-1997 14:39:32.24
To: REFER3::VINCENT,REFER3::MURPHY
CC: REFER3::DAVIES,REFDV1::HEALEY
Subj: DECSCHEDULER on CASPRO
Hi Folks,
I finally was contacted by the CSC regarding the Scheduler
problem. The rep believes that the problem is a corrupted
"dependencies.dat" file.
The solution to that is to shut down Scheduler, rename the file
and restart it on one node. That startup looks for the file and
if it dosen't find it, it uses the VSS.DAt file to rebuild it.
I did that at approx. 2:30 this afternoon. Scheduler is back
up on both nodes and the file is brand new. Let's see what
tonight brings us.
Thanks,
Bob
|
1219.2 | make sure network dependencies are ok ! | RUMOR::FALEK | ex-TU58 King | Fri Feb 21 1997 14:48 | 10 |
| Be careful - when you delete the dependency.dat file and then have the
scheduler create a new one, some of the network dependnecy information
is lost. If you have network job dependencies, you will have to
carefully hand-create them again. However, all the local job
dependency info is preserved.
By the way, in the $ sched show job/full display, the user interface
shows the dependencies that are already satisfied in [ ], so if
there are multiple jobs that a job depends on, you can see which
specific ones it is still waiting for.
|