Title: | SCHEDULER |
Notice: | Welcome to the Scheduler Conference on node HUMANE ril |
Moderator: | RUMOR::FALEK |
Created: | Sat Mar 20 1993 |
Last Modified: | Tue Jun 03 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1240 |
Total number of notes: | 5017 |
We've got a large Scheduler customer here with a weird problem. They have a mixed architecture cluster of 3 VAX's and 3 Alpha's (which are all 8400's) running various versions of VMS 6.2. This customer currently has 2.1B-1 running on all 6 nodes and they have load balancing enabled. They have 2000 (yes two thousand!) scheduler jobs, all of which are batch jobs. Since the weekend when they rebooted several of the systems for regular maintenance they have been unable to get Scheduler working. All the NSCHED processes run but they seem to get "stuck" and stop processing anything. Even with debug turned on nothing is reported to the log file initially. Killing the default NSCHED process gets the new default node to execute some more jobs and then it gets stuck and the cycle is repeated. It looks as if Scheduler is having problems communicating with the nodes in its own cluster. To make matters worse VSS.DAT ended up corrupted and it was recreated from a backup (with a new dependency.dat). The additional problem is that Scheduler seems unable to cope with loading 2000 new jobs, it consistently gets stuck loading the 1000th job - although no process quotas are being exhausted. Anyone got any suggestions? Tonight we're 1. Loading 2.1B-9 2. Performing a cluster reboot 3. Rebuilding the database but only loading 20 jobs - if that works this will be increased to 500 jobs. Any other suggestions would be appreciated!! Thanks, Tony Parsons, Sydney CSC
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
1179.1 | exactly 1000 jobs ? | RUMOR::FALEK | ex-TU58 King | Tue Nov 05 1996 13:40 | 30 |
Did this used to work with this many batch-mode jobs submitted in queues simultaneously, and has only broken now that you've rebooted some nodes ? If that's the case, I'd suspect ENQUELM quota for the NSCHED process. Make it real big (doesn't cost extra if its not used, since its just a limit). Also BYTELM, though that's less likely, and ASTLM. Is it consistently hanging at exactly 1000 pending jobs submitted to the batch queues? That is VERY suspicious. I wonder if a resource name used for locks associated with batch mode jobs has a length limit (3 ascii characters as part of name?) on the bad assumption that there would never be more than 999 jobs simultaneously in queues? What state is the NSCHED process in on the "Default" node? Does it have any ASTs pending? Are all the NSCHEDs hung (for example, if you have a detached mode job that is restricted to a node other than the default node, can you manually do $ SCHED RUN job and does it start ? I would guess that the problem is not due to the mixed architecture-ness of the cluster. If it seems to always get in trouble submitting job number 1000 to the queues, it could be a problem (bug) with lock resource names, but if used to work with 2000 jobs and just now broke on the cluster re-boot, it is probably a quota related problem and not a resource name length bug. I don't have access to the source code, so I can't check. |