[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:SCHEDULER
Notice:Welcome to the Scheduler Conference on node HUMANEril
Moderator:RUMOR::FALEK
Created:Sat Mar 20 1993
Last Modified:Tue Jun 03 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1240
Total number of notes:5017

1209.0. "V2.1B-5 takes an hour to HOLD a job" by GIDDAY::CAMERON (And there shall come FORTH (Isaiah 11:1)) Thu Feb 06 1997 18:43

    A customer of mine is experiencing difficulties with DECscheduler
    V2.1B-5.  We have not found any match on this problem in COMET nor in
    this notes conference during a brief search.  Does anyone recognise it?
    
    In short, massive slow-down of the product.
    
    Here is what the customer wrote:
    
    "We have a three host VAXcluster with two VAX 7800s and a VAX 6650.
    Each node in this cluster is running DECscheduler.  Every so often the
    schedulers on all machines start taking a long time to process
    commands. For example, it may take over an hour to process a SCHED HOLD
    command. The requested state field on the job changes within about 10
    seconds... it just takes about an hour to action the state change.
    
    There is another problem in that it loses track of which jobs have been
    processed, in that when you issue the command
    
    	SHED SHOW JOB /STAT=RUN
    
    it displays information about jobs that have completed a long time ago.
    And because the other jobs appear to be running, no new jobs can start.

    We have previously found that starting a new scheduler log file (called
    nsched$:vermont_creamery.log) when the old log file reached 25000
    blocks improved the response times. This no longer has any effect.
    
    Similarly making the scheduler data files contiguous used to work but
    not any longer.
    
    To fix the last scheduler slowness problem we ended up having to reboot
    the cluster."
    
    Ref: CSC STL K20013
T.RTitleUserPersonal
Name
DateLines
1209.1see elsewereHLFS00::ERIC_SEric Sonneveld MCS - B.O. IS HollandFri Feb 07 1997 05:118
    This looks like either a busy system and a bad tuned scheduler
    environment.
    Look at # 1206.1
    
    Perform a $ scheduler check/all
    Perform a $ sched sh delete
    
    Eric
1209.2GIDDAY::CAMERONAnd there shall come FORTH (Isaiah 11:1)Mon Feb 10 1997 18:42105
    Re: Note 1209.1 by HLFS00::ERIC_S
    
>   This looks like either a busy system and a bad tuned scheduler
>   environment.  Look at # 1206.1
    
    Thanks much for that, Eric.  I relayed the article to the customer and
    asked them to confirm each point in return.  They did so, and their
    reply follows.
    
    In short, the command execution time is down to about 15 seconds, and
    they don't say that they did anything to cause that.  They had a couple
    of freezes recently.
    
    I'm guessing there might be another reason for the freeze, though the
    size of the creamery log file appears to be within the range of
    trouble.
    
    James Cameron
    Sydney CSC.
    
    
    From:	SMTP%"LinscottS@..." 11-FEB-1997 08:49:09.78
    To:		<[email protected]>
    Subj:	RE: K20013 - Response Time Is Slow To DECscheduler Commands

    I have added comments (preceded with ***) in the following document.

    Sam

    [...]
    
    *** We have had two problems in the last week with the scheduler
    freezing. At the moment it seems to be ok. It is taking about 15
    seconds to action state change requests (such as SCHED HOLD commands).

>[DECsched] Response Time Is Slow To DECscheduler Commands
>SOURCE:     Digital Customer Support Center

    [snip...]

>o  System resources are fully used.  In this case other system
>   processes will also be slow.  Troubleshoot as a system resource
>   problem and not a DECscheduler problem.  Check out system
>   parameters to identify where the bottleneck might be.

    *** There are no significant resource problems. The scheduler database
    is on a shadowed disk (spread over 2 HSJs. Users notice no problems
    with response times.

    On one occasion the NSCHED$ disk filled up, and we experienced major
    scheduler problems. However even with a million blocks free space on
    this disk we still have problems.

>o  Debugger logfile (NSCHED$:NODENAME.LOG) is larger than 500 blocks.

    *** The Debugger log files average about 20 blocks for each node in the
    cluster.

>o  The history log file is greater than 2500 blocks.  The history
>   logfile is pointed to by the logical NSCHED$LOGFILE or if
>   undefined is NSCHED$:VERMONT_CREAMERY.LOG by default.

    *** We have had problems with vermont_creamery.log before, and have
    previously been creating a new version of the log file when it reaches
    25000 blocks. Currently the log file is 4200 blocks. We have noticed
    lately that starting a new version of the log file does not improve
    response times.

>o  DECscheduler logging may need to be reduced.

    *** We log 5 events (1 job and 4 abnormal) on each node in the cluster.
    We guess that this information is used by the SCHED SHO HIST command.

>o  The DECscheduler database (VSS.DAT) may be fragmented from the
>   number of deletes being greater than 200.

    *** The number of deletes is 46

>o  The DECscheduler default node transition may be set to a slower
>   system.  Issue the command "$ SCHED SHOW STATUS" for the following
>   output:

    *** The default node transition was a vax 6600. I have just moved it to
    a vax 7800.

>o  The NSCHED priority may have been lowered.  In the example
>   above(#5), the "Pri" field is the default priority that all jobs
>   will run at.  Is this number less than four?  If so, consider
>   increasing this priority.

    *** The default priority has always been 4 for all nodes.

>o  A new node is currently bringing up DECscheduler. This may cause a
>   temporary slowness in writing information to the database.

    *** All nodes have been up for several days.

>o  Review cluster node system times. If your operating in a clustered
>   environment, DECscheduler's performance can greatly be effected if
>   the system clocks on the various nodes don't agree.

    *** All nodes have the same time. NTP is used to keep the times
    synchronised.

    [end]