[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:	SCHEDULER
Notice:	Welcome to the Scheduler Conference on node HUMANEril
Moderator:	RUMOR::FALEK

Created:	Sat Mar 20 1993
Last Modified:	Tue Jun 03 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1240
Total number of notes:	5017

1222.0. "Job lost sync. in batch mode" by HTSC04::WILLIAMCHAN () Thu Mar 06 1997 05:20

    Hello,
    
    	    Customer using Polycenter Scheduler V2.1-B in a VAXcluster 
            environment V5.5-2 homogenerously. 
            According the report from customer, when he using batch mode
            to submit the jobs from node A thru the generic queue, if
            the execution queue is on node B (i.e. node B's batch queue
            take up the job and execute it) upon the completion of job;
            node A did not know about it and node A will re-execute the
            job a few time until somehow the job eventually executed by     
            execution queue on node A (the job issuer's node); then the
            acknowledge will get back to node A and node A will stop.
    
            It seems quite strange and I can't simulate the problem due
            to the resources restriction. Is there any setting should 
            be aware or any logfiles (debug mode) I should take from the
            customer?
    
    	    Thanks for any help.
    
    
    Best regards,
    William

T.R	Title	User	Personal Name	Date	Lines
1222.1		ZEKE::BURTON	Jim Burton, DTN 381-6470	`Thu Mar 06 1997 07:00`	9
	William, I would suggest elevating this to the CSC in your area or the one in Colorado Springs. They are trained to provide this support and they have second and third level support available directly from the engineers at CA. If the CSC system with regards to POLYCENTER is broken in your area, please let us know and we'll fix it. Jim
1222.2	Thanks	HTSC04::WILLIAMCHAN		`Fri Mar 07 1997 04:29`	9
	Thanks Jim, I'm just working in HK CSC and help to troubleshoot the customer's problem. I don't know what is the appropriate party to assist me in this case. Is CA or DEC? Thanks for further hint. Best regards, William
1222.3		ZEKE::BURTON	Jim Burton, DTN 381-6470	`Fri Mar 07 1997 19:42`	4
	Please get in touch with Curtis Chase @OGO to define an escalation path for you. Jim
1222.4	Is everything shared ???	HLFS00::ERIC_S	Eric Sonneveld MCS - B.O. IS Holland	`Mon Mar 10 1997 02:07`	22
	> > It seems quite strange and I can't simulate the problem due > to the resources restriction. Is there any setting should > be aware or any logfiles (debug mode) I should take from the > customer? > Never seen this in this way to happen. THere is however one thing to keep in mind that's when scheudler does not know about the job on node B. When the jobs steps into another fase it send a message to the scheudler mailbox. If there is no mailbox (= no bsched process on that system) then scheduler will never know what did happen to the job. It might consider it as failed and if retry is specified it might restart. So check if scheduler is started and running on that B node. and that the scheduler database is shared on all nodes in the cluster. I've seen customers with a cluster not sharing the scheduler database and having simular problems. E.g. a devellopement system with it's own sched db running in the same cluster. Scheduler will really malfunction in such situations because the scheudler processes in the cluster 'talk'to each other. Eric
1222.5		HTSC04::WILLIAMCHAN		`Sun Mar 23 1997 22:30`	11
	Thanks Eric, I just checked with customer and the Scheduler is running on both node and the database - VSS.dat is commonly used in Nsched directory. Is there anything else I should check with? Thanks for further help. Best regards, William
1222.6	Can scheduler listen to the termination message ?	HLFS00::ERIC_S	Eric Sonneveld MCS - B.O. IS Holland	`Wed Mar 26 1997 02:04`	4
	Does the (system) logical nschedmail exist ? Does the related mba device exist ? Eric
1222.7	Further on-site founding	HTSC04::WILLIAMCHAN		`Wed Apr 09 1997 07:53`	203
	Thanks Eric, After the on-site investigation, more precise picture is captured from the customer: Homogeneous VAXcluster OpenVMS/VAX V5.5-2; Polycenter Scheduler V2.1B-9. Here is two nodes name 1. Magna and 2. Jena. A clusterwide accessible generic batch queue called: ba$batch point to node specific sys$batch: Magna - ba$batch_magna and Jena - ba$batch_jena. The problem is: If we stop one of the node-specific sys$batch queue at the first beginning (before the job executed), lets say Ba$batch_magna; when starting to run the batch job with dependencies (either from GUI interface or SCHED> prompt), resonably, all jobs will managed by sys$batch and executed by BA$batch_jena. If at some time (the job chain is A to B to C to D to E to F), for example at C, we $start/queue BA$batch_magna; then the Scheduler will recognise a "new" queue was detected and start to throw the job - may be B or C randomly and handled by BA$Batch_Magna, then these jobs keep looping and re-execute themselves UNTIL we $stop/que/next BA$batch_Magna; then job B will fall back to BA$Batch_jena and execute the remain job in the chain (i.e. B,C,D,E,F,G) then the chain will stop and wait for next schedule. There is no problem found if the "stopped" queue was not resumed; i.e. one execution batch queue only thorough the execution. Also it is okay if no execution queues was suspended/resumed during the job execution. It looks like Scheduler is able to detect the queues changes and do the load balancing or may be the lock manager? But however the job status cannot be return correctly then lead to the re-execute jobs without control. p.s. It's no difference which node initiate the job. Customer's issue is - they live environment has more and more dependency jobs than this testing environment. They found when the job limit of one execution queue is reached for a part of a job e.g. A,B,C; then D,E,F will "switch" to other execution queues and it never completed. The guranteed method is user need to make sure that the job chain is completed at only one execution batch queue from start to end. In the above testing $Start/stop queue is only simulate the situation when the queues is not available (such as reach the job limit). Herebelow is the logfiles extract from the customer: This job chain contains WL_TEST1 to WL_TEST6: The job started at BA$Batch_Jena and BA$Batch_Magna is stopped: Job started at BA$Batch_Jena and looks okay for WL_test1 Scheduler job #646 (Name: WL_TEST1) Queued on BA$BATCH Entry 108 Scheduler job #646 (Name: WL_TEST1) Started in Batch Queue on node JENA Entry 108 Scheduler Job #646 (Name: WL_TEST1) Completed on node JENA Also normal to WL_TEST2 Scheduler job #647 (Name: WL_TEST2) Queued on BA$BATCH Entry 109 Scheduler job #647 (Name: WL_TEST2) Started in Batch Queue on node JENA Entry 109 Scheduler Job #647 (Name: WL_TEST2) Completed on node JENA WL_test3 looks good Scheduler job #648 (Name: WL_TEST3) Queued on BA$BATCH Entry 110 Scheduler job #648 (Name: WL_TEST3) Started in Batch Queue on node JENA Entry 110 Scheduler Job #648 (Name: WL_TEST3) Completed on node JENA WLtest4 handled by BA$batch_magna because this queue has been started Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 111 Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node MAGNA Entry 111 Scheduler Job #653 (Name: WL_TEST4) Completed on node MAGNA WLtest4 begin to requeue itself Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 113 Scheduler job #654 (Name: WL_TEST5) Queued on BA$BATCH Entry 114 Scheduler job #654 (Name: WL_TEST5) Started in Batch Queue on node JENA Entry 114 WLtest4 started again Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node MAGNA Entry 113 Scheduler Job #654 (Name: WL_TEST5) Completed on node JENA Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 115 WLtest4 keep looping Scheduler Job #653 (Name: WL_TEST4) Completed on node MAGNA Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 116 Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node JENA Entry 116 Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node MAGNA Entry 115 Scheduler Job #653 (Name: WL_TEST4) Completed on node JENA Scheduler job #654 (Name: WL_TEST5) Queued on BA$BATCH Entry 117 Scheduler job #654 (Name: WL_TEST5) Started in Batch Queue on node JENA Entry 117 Scheduler Job #654 (Name: WL_TEST5) Completed on node JENA Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA WL_TEST6 begin looping Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 118 Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node MAGNA Entry 118 Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA Looping Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 119 Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node MAGNA Entry 119 Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA Looping again Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 120 Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node MAGNA Entry 120 Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA Looping again and again Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 122 Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node MAGNA Entry 122 Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA Again again and again Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 123 Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node MAGNA Entry 123 Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA Again again........................ Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 125 Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node MAGNA Entry 125 Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 126 Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node MAGNA Entry 126 $ lo SLSUSER logged out at 9-APR-1997 15:23:34.75 Any hints on that? Please feel free to contact for more information. Thanks & best regards, William