[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:	SCHEDULER
Notice:	Welcome to the Scheduler Conference on node HUMANEril
Moderator:	RUMOR::FALEK

Created:	Sat Mar 20 1993
Last Modified:	Tue Jun 03 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1240
Total number of notes:	5017

1179.0. "Scheduler hung in mixed architecture cluster" by GIDDAY::PARSONS (Punicenter Support) Tue Nov 05 1996 00:56

    
    We've got a large Scheduler customer here with a weird problem.
    
    They have a mixed architecture cluster of 3 VAX's and 3 Alpha's 
    (which are all 8400's) running various versions of VMS 6.2.
    
    This customer currently has 2.1B-1 running on all 6 nodes and they
    have load balancing enabled.  They have 2000 (yes two thousand!)
    scheduler jobs, all of which are batch jobs.  Since the weekend when
    they rebooted several of the systems for regular maintenance they
    have been unable to get Scheduler working.
    
    All the NSCHED processes run but they seem to get "stuck" and stop
    processing anything.  Even with debug turned on nothing is reported
    to the log file initially.  Killing the default NSCHED process gets the
    new default node to execute some more jobs and then it gets stuck and
    the cycle is repeated.  It looks as if Scheduler is having problems
    communicating with the nodes in its own cluster.
    
    To make matters worse VSS.DAT ended up corrupted and it was recreated
    from a backup (with a new dependency.dat).  The additional problem is
    that Scheduler seems unable to cope with loading 2000 new jobs, it
    consistently gets stuck loading the 1000th job - although no process
    quotas are being exhausted.
    
    Anyone got any suggestions?
    
    Tonight we're
    
    1. Loading 2.1B-9
    2. Performing a cluster reboot
    3. Rebuilding the database but only loading 20 jobs - if that works
       this will be increased to 500 jobs.
    
    Any other suggestions would be appreciated!!
    
    Thanks,
    
    Tony Parsons,
    Sydney CSC

T.R Title User Personal
Name Date Lines

1179.1 exactly 1000 jobs ? RUMOR::FALEK ex-TU58 King Tue Nov 05 1996 13:40 30

T.R	Title	User	Personal Name	Date	Lines
1179.1	exactly 1000 jobs ?	RUMOR::FALEK	ex-TU58 King	`Tue Nov 05 1996 13:40`	30
	Did this used to work with this many batch-mode jobs submitted in queues simultaneously, and has only broken now that you've rebooted some nodes ? If that's the case, I'd suspect ENQUELM quota for the NSCHED process. Make it real big (doesn't cost extra if its not used, since its just a limit). Also BYTELM, though that's less likely, and ASTLM. Is it consistently hanging at exactly 1000 pending jobs submitted to the batch queues? That is VERY suspicious. I wonder if a resource name used for locks associated with batch mode jobs has a length limit (3 ascii characters as part of name?) on the bad assumption that there would never be more than 999 jobs simultaneously in queues? What state is the NSCHED process in on the "Default" node? Does it have any ASTs pending? Are all the NSCHEDs hung (for example, if you have a detached mode job that is restricted to a node other than the default node, can you manually do $ SCHED RUN job and does it start ? I would guess that the problem is not due to the mixed architecture-ness of the cluster. If it seems to always get in trouble submitting job number 1000 to the queues, it could be a problem (bug) with lock resource names, but if used to work with 2000 jobs and just now broke on the cluster re-boot, it is probably a quota related problem and not a resource name length bug. I don't have access to the source code, so I can't check.

    Did this used to work with this many batch-mode jobs submitted in queues
    simultaneously, and has only broken now that you've rebooted some nodes
    ?
    
    If that's the case, I'd suspect ENQUELM quota for the NSCHED process.
    Make it real big (doesn't cost extra if its not used, since its just
    a limit). Also BYTELM, though that's less likely, and ASTLM.
    
    Is it consistently hanging at exactly 1000 pending jobs submitted to
    the batch queues?   That is VERY suspicious.  I wonder if a resource
    name used for locks associated with batch mode jobs has a length limit
    (3 ascii characters as part of name?) on the bad assumption that there
    would never be more than 999 jobs simultaneously in queues?
    
    What state is the NSCHED process in on the "Default" node?   Does it
    have any ASTs pending?
    
    Are all the NSCHEDs hung (for example, if you have a detached mode job
    that is restricted to a node other than the default node, can you manually
    do $ SCHED RUN job     and does it start ?
      
    I would guess that the problem is not due to the mixed
    architecture-ness of the cluster.   If it seems to always get in
    trouble submitting job number 1000 to the queues, it could be 
    a problem (bug) with lock resource names, but if used to work with
    2000 jobs and just now broke on the cluster re-boot, it is probably
    a quota related problem and not a resource name length bug.
    
    I don't have access to the source code, so I can't check.