| Some most frequently happening problems are describted in several TIMA stars
articles:
one of them:
[DECsched] Response Time Is Slow To DECscheduler Commands
COPYRIGHT (c) 1988, 1993 by Digital Equipment Corporation.
ALL RIGHTS RESERVED. No distribution except as provided under contract.
Copyright (c) Digital Equipment Corporation 1993. All rights reserved
PRODUCT: DECscheduler for OpenVMS Versions 1.1A, 2.0, 2.0A
OP/SYS: OpenVMS VAX Versions 5.0, 5.0-1, 5.0-2, 5.1, 5.1-B,
5.1-1, 5.1-2, 5.2, 5.2-1, 5.3,
5.3-1, 5.3-2, 5.4, 5.4-1, 5.4-2,
5.4-3, 5.5, 5.5-1, 5.5-2
SOURCE: Digital Customer Support Center
SYMPTOM:
DECscheduler response time is slow when doing normal DECscheduler
commands.
ANALYSIS/SOLUTION:
The following are possible causes and suggested solutions.
o System resources are fully used. In this case other system
processes will also be slow. Troubleshoot as a system resource
problem and not a DECscheduler problem. Check out system
parameters to identify where the bottleneck might be.
o Debugger logfile (NSCHED$:NODENAME.LOG) is larger than 500 blocks.
To verify if debug is turned on for your node(default), issue
a SCHED SHOW STATUS and look for a "D" next to the rating.
$ SCHED SHOW STATUS
Node Version Started Jobs Jmax Log Pri Rating
Node V2.0A 1-MAR-1993 16:22:37 0 6 5 4 3917 D
DECscheduler debugger logging can be turned off completely with the
command "$ SCHED SET DEBUG OFF" for a possible fix to the problem.
o The history log file is greater than 2500 blocks. The history
logfile is pointed to by the logical NSCHED$LOGFILE or if
undefined is NSCHED$:VERMONT_CREAMERY.LOG by default.
Issue the command "$ SCHED CLOSE LOG/SUMMARIZE" to recreate a new
history log file. If you are operating in a clustered environment
and are using a separate log file for each node, issue the command
"$ SCHED CLOSE LOG_FILE/SERVER=node" to open a new log file on
each node.
o DECscheduler logging may need to be reduced. If logging is turned
on, typically all job events and all abnormal events are written
to the log file. On a system that has hundreds of DECscheduler
jobs running throughout the day, DECscheduler can be slowed down
by the volume of events it logs. A system's DECscheduler logging
settings can be checked by issuing the command:
$ SCHED SHOW LOGGING/ALL
If JOB_EVENT and ABNORMAL_EVENT are set, you may want to turn off
some of the event logging. This decision may need to be reached by
consulting other DECscheduler users on the system.
To test if this is the problem, turn off all DECscheduler logging
with the command, "$ SCHED SET LOGGING NONE" and retry the slow
commands and see if DECscheduler performance has improved.
If it is determined that, for example, job event logging should be
turned off to save I/O to the file, issue the following command:
$ SCHED SET LOGGING ABNORMAL, NOJOB_EVENT/ALL
o The DECscheduler database (VSS.DAT) may be fragmented from the
number of deletes being greater than 200.
Issue the command "$ SCHED SHOW DELETES" to get the number of
deletes in the database. If these are greater than 200, stop
DECscheduler and rebuild the database with the following commands:
$ SCHED STOP/ALL
$ SET DEF NSCHED$
$ RUN DB_UTILITY
o The DECscheduler default node transition may be set to a slower
system. Issue the command "$ SCHED SHOW STATUS" for the following
output:
Node Version Started Jobs Jmax Log Pri Rating
NODE1 V2.0A <date_stamp> 0 6 5 4 20 D <--
NODE2 V2.0A <date_stamp> 0 6 5 4 3917
NODE3 V2.0A <date_stamp> 0 6 5 4 79
NODE4 V2.0A <date_stamp> 0 6 5 4 305
Note that in this cluster, the rating for the default(D) NODE1 is
quite a bit smaller than NODE2. Issue the command,
"$ SCHED SET DEFAULT NODE2". This will change the default system to
NODE2, the faster system in the cluster, as demonstrated:
Node Version Started Jobs Jmax Log Pri Rating
NODE2 V2.0A <date_stamp> 0 6 5 4 3917 D <--
NODE1 V2.0A <date_stamp> 0 6 5 4 20
NODE3 V2.0A <date_stamp> 0 6 5 4 79
NODE4 V2.0A <date_stamp> 0 6 5 4 305
o The NSCHED priority may have been lowered. In the example
above(#5), the "Pri" field is the default priority that all jobs
will run at. Is this number less than four? If so, consider
increasing this priority.
o A new node is currently bringing up DECscheduler. This may cause a
temporary slowness in writing information to the database.
o Review cluster node system times. If your operating in a clustered
environment, DECscheduler's performance can greatly be effected if
the system clocks on the various nodes don't agree.
To see the system time on all nodes issue the following SYSMAN
commands:
$ MCR SYSMAN
SYSMAN> SET ENVIRONMENT/CLUSTER
SYSMAN> SET TIMEOUT 00:00:10
SYSMAN> DO SHOW TIME
REFERENCES:
"DECscheduler for OpenVMS User Information ", August 1992,
(AA-PBJ3C-TE), Page(s)3-12,3-19.
|
| Highly unusual... I wonder what it could be doing...
To gather information -
Check what files that NSCHED process has open ($ sho device/files) and
see if there's anything other than the normal scheduler files in
NSCHED$
Look at the scheduler's 2 mailboxes and see if there's a high I/O rate
($ show dev mba)
Look at the NSCHED process with SDA ($ analyze/system) - anything
strange? is it doing a lot of locking, if so, on what resource?
Easiest,
If this happens right away whenever the acheduler is started on that
node, kill that nsched processs; first try $ sched stop/node= but if
that doesn't work, do a $ stop/id on it , then on a terminal do
$ run nsched$:nsched and watch the output for a clue
|
| Hi,
Referring to the previous note:
> Check what files that NSCHED process has open ($ sho device/files) and
> see if there's anything other than the normal scheduler files in
> NSCHED$
Only the normal scheduler files are open.
> Look at the scheduler's 2 mailboxes and see if there's a high I/O rate
> ($ show dev mba)
Ed from CSC walked me through SDA to look at the mailbox statistics. Below is
the output. MBA1518 is the NSCHED$MAILBOX. The NSCHED$TERM_MAILBOX (MBA1523)
has a high "Operation count" number. Ed assumed this was the thing to look
at. Can you give me/us more information about this?
> Look at the NSCHED process with SDA ($ analyze/system) - anything
> strange? is it doing a lot of locking, if so, on what resource?
Doesn't appear to be doing anything abnormal.
> $ run nsched$:nsched and watch the output for a clue
I didn't see anything unusual here either.
Lew - Digital Semiconductor - Hudson, MA
------------------------------------------------------------------------------
Other info:
Logging is turned off completely.
--------------------------------
FMCSB> sched show status
Node Version Started Jobs Jmax Log Pri Rating
FMCSA V2.1B-9 15-FEB-1997 03:07:28 0 6 0 4 3984 <-- Default
FMCSB V2.1B-9 27-FEB-1997 12:51:29 0 6 0 4 4070
---------------------------------------------------------------------------
Here's the "$ SHOW SYS" output of the "sched" processes:
044084CA NSCHED LEF 8 3786523 0 01:09:04.90 5627 5384
044084CB SCHED_REMOTE LEF 6 69 0 00:00:00.27 647 1138
(Running about 14 days).
------------------------------------------------------------------------------
I/O data structures
-------------------
MBA1518 MBX UCB address: 85217700
Device status: 00000010 online
Characteristics: 0C150001 rec,shr,avl,mbx,idv,odv
00000200 nnm
Owner UIC [000001,000004] Operation count 100 ORB address 84E19580
PID 00000000 Error count 0 DDB address 83536AF8
Class/Type A0/01 Reference count 1 DDT address 83430F08
Def. buf. size 25 BOFF 0000 CRB address 8353728C
DEVDEPEND 00000000 Byte count 0000 LNM address 84B10E30
DEVDEPND2 00000000 SVAPTE 00000000 I/O wait queue empty
FLCK index 28 DEVSTS 0002
DLCK address 83531800
Charge PID 004200CA
*** I/O request queue is empty ***
-------------------------------------------------------------------------------
I/O data structures
-------------------
MBA1523 MBX UCB address: 85210100
Device status: 00000010 online
Characteristics: 0C150001 rec,shr,avl,mbx,idv,odv
00000200 nnm
Owner UIC [000001,000004] Operation count 39972 ORB address 851731C0
PID 00000000 Error count 0 DDB address 83536AF8
Class/Type A0/01 Reference count 1 DDT address 83430F08
Def. buf. size 90 BOFF 0000 CRB address 8353728C
DEVDEPEND 00000000 Byte count 0000 LNM address 84B11030
DEVDEPND2 00000000 SVAPTE 00000000 I/O wait queue empty
FLCK index 28 DEVSTS 0002
DLCK address 83531800
Charge PID 004200CA
*** I/O request queue is empty ***
SDA
|
| My guess (based on conjecture only) is that when this happens,
job processing has gotten into some sort of loop - and since
runaway I/Os are happening on a scheduler that is not the "default"
scheduler, the circumstances are unusual. For jobs that have no node
restriction, the default scheduler does the work of figuring out when jobs
should run. It sets a bit in the job's database record and sends a
mailbox message to the selected scheduler telling it to run the job.
(that's why I asked about mailbox I/O - there are 2 mailboxes per
NSCHED, one is used mostly for process termination messages from
VMS, and the other for commands. Also, each scheduler handles scheduling
for the jobs on its own node that are restricted to run there.
I suspect that when job processing hits a particular job (either on the
default scheduler or the individual scheduler(s) that have this
problem) something happens that causes a loop. If a job is constantly
aborted and re-tried, the culprit will have a very high failure count in
its job record (since that will incriment on each attempt). There could
also be an error or loop in job dependency checking for that job,
though I haven't ever seen this. Another possibility is that there is
nothing wrong with the scheduling at all, but its something external -
for example, a $ SCHED CHECK command makes the scheduler read the whole
job database and verify every record - if you put this in a loop you
will bring the scheduler to its knees. These things get logged to the
event log - have you run NSCHED_REPORTS (or whatever its called now)
for a scheduler event report to see if there's anything suspicious?
If you were running NSCHED on a terminal as per .-1, and didn't see
anything suspicious, was that WHILE THE I/O PROBLEM WAS OCCURING ?
When you start the NSCHED process, it should read through all the job
database, process each record, and then say "Sleeping..."
It wakes up when it gets messages in the job termination mailbox or the
command mailbox. If it is sleeping, it should be in LEF state - NO
I/O. If the mailbox is continually flooded with messages or it never
gets through job processing to the "Sleeping..." message there is a
problem.
I suspect when you were looking at it, it wasn't in trouble.
Alternatively, you can
1. stop all the schedulers for a moment, delete all the
nodename.log files in nsched$ then restart the schedulers (or you can
skip this step, its not necessary if the nodename.log files aren't very
big)
2. do $ sched set debug on /all
This will make NSCHEDs log everything to the nodename.log files.
WARNING - they will become very big quickly and that will hurt
performance - don't operate in this mode regularly - just watch
carefully until the problem re-occurs.
3. Look at the last entries in the nodename.log files for the default
scheduler and the one that has the I/O problem. That will give enough
info to narrow down the problem cause.
4. Sched set debug off/all (and delete the nodename.log files if they
are big - its bad to have the scheduler have to append to the end of
them when they are large.
|