[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:	SCHEDULER
Notice:	Welcome to the Scheduler Conference on node HUMANEril
Moderator:	RUMOR::FALEK

Created:	Sat Mar 20 1993
Last Modified:	Tue Jun 03 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1240
Total number of notes:	5017

1209.0. "V2.1B-5 takes an hour to HOLD a job" by GIDDAY::CAMERON (And there shall come FORTH (Isaiah 11:1)) Thu Feb 06 1997 18:43

    A customer of mine is experiencing difficulties with DECscheduler
    V2.1B-5.  We have not found any match on this problem in COMET nor in
    this notes conference during a brief search.  Does anyone recognise it?
    
    In short, massive slow-down of the product.
    
    Here is what the customer wrote:
    
    "We have a three host VAXcluster with two VAX 7800s and a VAX 6650.
    Each node in this cluster is running DECscheduler.  Every so often the
    schedulers on all machines start taking a long time to process
    commands. For example, it may take over an hour to process a SCHED HOLD
    command. The requested state field on the job changes within about 10
    seconds... it just takes about an hour to action the state change.
    
    There is another problem in that it loses track of which jobs have been
    processed, in that when you issue the command
    
    	SHED SHOW JOB /STAT=RUN
    
    it displays information about jobs that have completed a long time ago.
    And because the other jobs appear to be running, no new jobs can start.

    We have previously found that starting a new scheduler log file (called
    nsched$:vermont_creamery.log) when the old log file reached 25000
    blocks improved the response times. This no longer has any effect.
    
    Similarly making the scheduler data files contiguous used to work but
    not any longer.
    
    To fix the last scheduler slowness problem we ended up having to reboot
    the cluster."
    
    Ref: CSC STL K20013

T.R	Title	User	Personal Name	Date	Lines
1209.1	see elsewere	HLFS00::ERIC_S	Eric Sonneveld MCS - B.O. IS Holland	`Fri Feb 07 1997 05:11`	8
	This looks like either a busy system and a bad tuned scheduler environment. Look at # 1206.1 Perform a $ scheduler check/all Perform a $ sched sh delete Eric
1209.2		GIDDAY::CAMERON	And there shall come FORTH (Isaiah 11:1)	`Mon Feb 10 1997 18:42`	105
	Re: Note 1209.1 by HLFS00::ERIC_S > This looks like either a busy system and a bad tuned scheduler > environment. Look at # 1206.1 Thanks much for that, Eric. I relayed the article to the customer and asked them to confirm each point in return. They did so, and their reply follows. In short, the command execution time is down to about 15 seconds, and they don't say that they did anything to cause that. They had a couple of freezes recently. I'm guessing there might be another reason for the freeze, though the size of the creamery log file appears to be within the range of trouble. James Cameron Sydney CSC. From: SMTP%"LinscottS@..." 11-FEB-1997 08:49:09.78 To: <[email protected]> Subj: RE: K20013 - Response Time Is Slow To DECscheduler Commands I have added comments (preceded with *) in the following document. Sam [...] * We have had two problems in the last week with the scheduler freezing. At the moment it seems to be ok. It is taking about 15 seconds to action state change requests (such as SCHED HOLD commands). >[DECsched] Response Time Is Slow To DECscheduler Commands >SOURCE: Digital Customer Support Center [snip...] >o System resources are fully used. In this case other system > processes will also be slow. Troubleshoot as a system resource > problem and not a DECscheduler problem. Check out system > parameters to identify where the bottleneck might be. * There are no significant resource problems. The scheduler database is on a shadowed disk (spread over 2 HSJs. Users notice no problems with response times. On one occasion the NSCHED$ disk filled up, and we experienced major scheduler problems. However even with a million blocks free space on this disk we still have problems. >o Debugger logfile (NSCHED$:NODENAME.LOG) is larger than 500 blocks. * The Debugger log files average about 20 blocks for each node in the cluster. >o The history log file is greater than 2500 blocks. The history > logfile is pointed to by the logical NSCHED$LOGFILE or if > undefined is NSCHED$:VERMONT_CREAMERY.LOG by default. * We have had problems with vermont_creamery.log before, and have previously been creating a new version of the log file when it reaches 25000 blocks. Currently the log file is 4200 blocks. We have noticed lately that starting a new version of the log file does not improve response times. >o DECscheduler logging may need to be reduced. * We log 5 events (1 job and 4 abnormal) on each node in the cluster. We guess that this information is used by the SCHED SHO HIST command. >o The DECscheduler database (VSS.DAT) may be fragmented from the > number of deletes being greater than 200. * The number of deletes is 46 >o The DECscheduler default node transition may be set to a slower > system. Issue the command "$ SCHED SHOW STATUS" for the following > output: * The default node transition was a vax 6600. I have just moved it to a vax 7800. >o The NSCHED priority may have been lowered. In the example > above(#5), the "Pri" field is the default priority that all jobs > will run at. Is this number less than four? If so, consider > increasing this priority. * The default priority has always been 4 for all nodes. >o A new node is currently bringing up DECscheduler. This may cause a > temporary slowness in writing information to the database. * All nodes have been up for several days. >o Review cluster node system times. If your operating in a clustered > environment, DECscheduler's performance can greatly be effected if > the system clocks on the various nodes don't agree. *** All nodes have the same time. NTP is used to keep the times synchronised. [end]