[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:	SCHEDULER
Notice:	Welcome to the Scheduler Conference on node HUMANEril
Moderator:	RUMOR::FALEK

Created:	Sat Mar 20 1993
Last Modified:	Tue Jun 03 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1240
Total number of notes:	5017

1234.0. "Default node question" by TAV02::GALIA (Galia Reznik, Israel Software Support) Wed Apr 02 1997 09:12

    Hi,
    
    Our customer has a following configuration:
    Cluster 2 * VAX V6.1 SCHED V2.1A  +  Alpha V6.1 SCHED V2.1B-1.
    The Database for mixed cluster is configured ok.
    
    When RUNing job to an Alpha que, an entry is created, and 
    $ SHOW PROC/CONT/ID=xxx shows that it executes SCHEDULER$DO_COMMAND.EXE
    and the process is in LEF state. It never ends. Scheduler, though, 
    thinks it ended.
    The above situation happens, when tha Alpha in NOT the default node.
    When they set the Alpha to be the default node - the job runs ok.
    But then, the VAX's jobs don't run.
    
    1. Is the above a supported configuration?
    2. If so, what should they do?
    
    Thanks,
    Galia Reznik,
    MCS, Israel.

T.R	Title	User	Personal Name	Date	Lines
1234.1	Are all batch queues accessible ?	RUMOR::FALEK	ex-TU58 King	`Wed Apr 02 1997 15:53`	8
	Are you talking here about "batch mode" (ie using batch queues) jobs ? Or "detached" mode ? If batch mode, make sure that the batch queues on both VAX and Alpha are properly visible to the scheduler. Are detached mode jobs (no batch queues) OK? So far as I know, this is supposed to work.
1234.2		TAV02::GALIA	Galia Reznik, Israel Software Support	`Mon Apr 07 1997 05:06`	45
	Hi, I do talk here about batch queues. They don't have detached mode jobs. The queues are visible (in SCHEDULER) from VAX to Alpha and from Alpha to VAX. As I mentioned in .0, they CAN send a job to a que, and it starts executing, but it never ends. It ends when the Alpha is the default node. I want to attach a part of the log file in DEBUG mode, which may help to trace the problem. This job, 3912, was sent from VAX to Alpha queue. Please note, that in each place where should be the Alpha's name, they got an empty place. Of course, Alpha's nodename in VMS is defined ok - in SCS and in NCP. Where should it be defined in SCHEDULER when the node is not the default one? And even though the SCHEDULER claims the entry ended, it never ends in the queue, it hangs executing an image (pls see .0). Thanks, Galia. DEBUG log-file: ---------------- we woke up! got mbx msg 'BS3912 ' Job start message for job# 3912 queue_job lock returned 1 lock id=1800728C CLUSTER_BROADCAST: node= msg=B+ <---------- told to update count <---------- got term mbx msg 'BE-785' job end message for pid FFFFFCEF CLUSTER_BROADCAST: node= msg=B- <---------- told to adjust count <---------- job status of ended batch job is Q job end status= 196609 DEQ_JOB_LOCK returned 1 job # 3912 finished.... count= 1 0 remote nodes care about job 3912 10:18 AM processing record # 3912 status= S request= Now= 2-Apr-1997 10:18:49.88 job_sched_time= 2-Apr-1997 14:18:45.81 job 3912 is scheduled for the future 10:18 AM updated record # 3912 status= S request= Found 0 local jobs depending on :: 3912 timer flag was clear timer not expired. No earlier event to set. sleeping
1234.3	a problem with logical SYS$NODE ?	RUMOR::FALEK	ex-TU58 King	`Mon Apr 07 1997 20:52`	16
	Aha ! The nodename being missing is certainly related to this problem! The message telling the "default" scheduler that the job ended is getting lost. When NSCHED.EXE starts, it finds out its nodename by translating logical SYS$NODE If you do a $ sched show status on that cluster, do the nodenames appear in the display ? Make sure that logical SYS$NODE is properly defined on both machines. There is probably a logical name (or else NSCHED wouldn't start) but it may have the wrong stuff in it. If this is DECnet phase V, make sure the logical has the phase 4 alias (6 characters, maximum). Then stop and restart the schedulers on both nodes. Is the problem still occuring ?
1234.4		TAV02::GODOVNIK	Haim Godovnik	`Thu Apr 10 1997 08:52`	18
	Hi, I am stepping in for Galia as She is on vacation. The scheduler starts after DECNET and the logical name SYS$NODE is defined correctly on all nodes. They do not use DECNET Phase V. They have asked restarted the scheduler on all on all nodes but nothing changed. They also defined the scheduler object in ncp. What else should we check? Did the scheduler database change between version 2.1A and 2.1B? Thanks for Your help, Haim G.
1234.5	one additional question	RUMOR::FALEK	ex-TU58 King	`Thu Apr 10 1997 16:51`	3
	If you do a $ sched show status do you see a good display (i.e all the nodenames are shown) ?
1234.6		TAV02::GODOVNIK	Haim Godovnik	`Sun Apr 13 1997 02:36`	12
	Hi, On the sched sho stat display He sees both ALPHA and VAX nodes. He added another ALPHA to the cluster and between the ALPHA's everything works fine. The problems are between VAX and ALPHA. Thanks, Haim G.
1234.7	plan of attack	RUMOR::FALEK	ex-TU58 King	`Mon Apr 14 1997 18:33`	45
	Ok, I'm running out of ideas... Lets summarize what we know The problem occurs when the "default" scheduler is running on an Alpha, but the batch execution queue is on a VAX The job actually runs and completes, but the scheduler system never detects that fact - so it "thinks" it is still running. The debug info shows that the "batch end" (BE) message is being broadcast to a scheduler with a null node name. Actually, on reflection, I now think this might actually be normal, since the "default" scheduler hears all the messages, and is supposed to react to ones where no nodename is specified - no node means the "default". So we may have been barking up the wrong tree with the "no node name" thing. We know a "batch end" message is getting sent when the job completes... So the question is, does the default scheduler actually GET this message, and if so, what does it do with it ? To answer this question you could 1. put all scheduler jobs that are likely to run accidently during the experiment on hold. Preferably, wait for all running jobs to finish. 2. Stop all schedulers in the cluster $ sched stop/all 3. On a hardcopy terminal or a screen where you can watch the output, on the Alpha system, $ run nsched$:nsched.exe It will print a lot of stuff as it reads thorugh all the jobs and then it will print "Sleeping..." 4. Start the scheduler on the VAX (The Alpha NSCHED will notice it started, you will see some output and then it will print "Sleeping..." again) 5. Run a Scheduler batch-mode job on the VAX. You will see some stuff print on the Alpha when the job starts. Then the Alpha will print "Sleeping...". When the job completes on the VAX, watch CAREFULLY what (if anything) the scheduler on the Alpha prints. If it doesn't print anything at all, then the BE message isn't being processed (valuable information). If it does print something then that will tell excatly what's going on - WHAT DOES IT SAY ?
1234.8		TAV02::GODOVNIK	Haim Godovnik	`Tue Apr 15 1997 07:27`	14
	Hi, Thank You for Your help. I have asked the customer to do the tests You have asked in .7. After the job completes He gets nothing on the screen after the last "Sleeping..." message. Which means that the BE message is not being processed. He also tried DETTACHED mode jobs and everything worked fine. The problem seems to be only in BATCH mode. Haim G.
1234.9	likely a bug that must be escalated	RUMOR::FALEK	ex-TU58 King	`Tue Apr 15 1997 14:07`	14
	Its probably a bug then, and most likely not a known one though I can't be sure. Batch jobs in heterogeneous VMSclusters are supposed to work! You've already gathered information that shows approximately what step in the job processing mechanism is failing. I hoped it would be something simple, like a queue file or logical name problem that I could suggest a fix for. I suspect this might be a bug that requires a patch. Unfortunately, I'm not a member of product engineering. However, the information you've already gathered should be very useful to them. You need to escalate this through official support channels. They need to search through their database to see if this is a known problem. They (the support org.) needs to figure out exactly why this is broken at your site and supply a fix !
1234.10		ZEKE::BURTON	Jim Burton, DTN 381-6470	`Tue Apr 15 1997 15:51`	5
	If you need to know the official escalation channel in your area, please contact Curtis Chase @OGO. Jim Scheduler Product Manager
1234.11	Problem solved	TAV02::GODOVNIK	Haim Godovnik	`Thu Apr 17 1997 05:23`	16
	Hi, Before escalating the problem I have asked the customer to upgrade the VAX from 2.1A to 2.1B. After the installation everything works fine. I do not know if something related to this was fixed in 2.1B or the reinstall simply solved it. Thank You very much for all Your help, Haim Godovnik, CSC Israel