[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:SCHEDULER
Notice:Welcome to the Scheduler Conference on node HUMANEril
Moderator:RUMOR::FALEK
Created:Sat Mar 20 1993
Last Modified:Tue Jun 03 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1240
Total number of notes:5017

1222.0. "Job lost sync. in batch mode" by HTSC04::WILLIAMCHAN () Thu Mar 06 1997 05:20

    Hello,
    
    	    Customer using Polycenter Scheduler V2.1-B in a VAXcluster 
            environment V5.5-2 homogenerously. 
            According the report from customer, when he using batch mode
            to submit the jobs from node A thru the generic queue, if
            the execution queue is on node B (i.e. node B's batch queue
            take up the job and execute it) upon the completion of job;
            node A did not know about it and node A will re-execute the
            job a few time until somehow the job eventually executed by     
            execution queue on node A (the job issuer's node); then the
            acknowledge will get back to node A and node A will stop.
    
            It seems quite strange and I can't simulate the problem due
            to the resources restriction. Is there any setting should 
            be aware or any logfiles (debug mode) I should take from the
            customer?
    
    	    Thanks for any help.
    
    
    Best regards,
    William
    
T.RTitleUserPersonal
Name
DateLines
1222.1ZEKE::BURTONJim Burton, DTN 381-6470Thu Mar 06 1997 07:009
William,

I would suggest elevating this to the CSC in your area or the one in
Colorado Springs.  They are trained to provide this support and they have
second and third level support available directly from the engineers at CA.  If
the CSC system with regards to POLYCENTER is broken in your area, please let us
know and we'll fix it. 

Jim 
1222.2ThanksHTSC04::WILLIAMCHANFri Mar 07 1997 04:299
Thanks Jim,

	I'm just working in HK CSC and help to troubleshoot the
	customer's problem. I don't know what is the appropriate
	party to assist me in this case. Is CA or DEC? Thanks for
	further hint.

Best regards,
William
1222.3ZEKE::BURTONJim Burton, DTN 381-6470Fri Mar 07 1997 19:424
Please get in touch with Curtis Chase @OGO to define an escalation path for
you.

Jim
1222.4Is everything shared ???HLFS00::ERIC_SEric Sonneveld MCS - B.O. IS HollandMon Mar 10 1997 02:0722
>    
>            It seems quite strange and I can't simulate the problem due
>            to the resources restriction. Is there any setting should 
>            be aware or any logfiles (debug mode) I should take from the
>            customer?
>    
Never seen this in this way to happen. THere is however one thing to keep in
mind that's when scheudler does not know about the job on node B. When the jobs
steps into another fase it send a message to the scheudler mailbox. If there is
no mailbox (= no bsched process on that system) then scheduler will never know
what did happen to the job.
It might consider it as failed and if retry is specified it might restart.

So check if scheduler is started and running on that B node.
and that the scheduler database is shared on all nodes in the cluster.

I've seen customers with a cluster not sharing the scheduler database and
having simular problems. E.g. a devellopement system with it's own sched db
running in the same cluster. Scheduler will really malfunction in such
situations because the scheudler processes in the cluster 'talk'to each other.

Eric
1222.5HTSC04::WILLIAMCHANSun Mar 23 1997 22:3011
Thanks Eric,

	I just checked with customer and the Scheduler is running on both node
	and the database - VSS.dat is commonly used in Nsched directory.

	Is there anything else I should check with?

	Thanks for further help.

Best regards,
William
1222.6Can scheduler listen to the termination message ?HLFS00::ERIC_SEric Sonneveld MCS - B.O. IS HollandWed Mar 26 1997 02:044
    Does the (system) logical nsched*mail* exist ? 
    Does the related mba device exist ?
    
    Eric
1222.7Further on-site foundingHTSC04::WILLIAMCHANWed Apr 09 1997 08:53203
    Thanks Eric,
    
    	After the on-site investigation, more precise picture is captured
    	from the customer:
    
    	Homogeneous VAXcluster OpenVMS/VAX V5.5-2; Polycenter
    	Scheduler V2.1B-9.
    
    	Here is two nodes name 1. Magna and 2. Jena. A clusterwide
    	accessible generic batch queue called: ba$batch point to node specific
    	sys$batch: Magna - ba$batch_magna and Jena - ba$batch_jena.
    
    
    	The problem is:
    
    	If we stop one of the node-specific sys$batch queue at the first 
    	beginning (before the job executed), lets say Ba$batch_magna;
    	when starting to run the batch job with dependencies (either from
    	GUI interface or SCHED> prompt), resonably, all jobs will managed by 
    	sys$batch and executed by BA$batch_jena. 
    
    	If at some time (the job chain is A to B to C to D to E to F), for
    	example at C, we $start/queue BA$batch_magna; then the Scheduler will
    	recognise a "new" queue was detected and start to throw the job - 
    	may be B or C randomly and handled by BA$Batch_Magna, then these jobs 
    	keep looping and re-execute themselves UNTIL we $stop/que/next 
    	BA$batch_Magna; then job B will fall back to BA$Batch_jena and
    	execute the remain job in the chain (i.e. B,C,D,E,F,G) then the chain 
    	will stop and wait for next schedule.
    
    	There is no problem found if the "stopped" queue was not resumed;
    	i.e. one execution batch queue only thorough the execution. Also it is
    	okay if no execution queues was suspended/resumed during the job
    	execution. It looks like Scheduler is able to detect the queues 
    	changes and do the load balancing or may be the lock manager? But 
    	however the job status cannot be return correctly then lead to the 
    	re-execute jobs without control. p.s. It's no difference which node
    	initiate the job.
    
    	Customer's issue is - they live environment has more and more 
    	dependency jobs than this testing environment. They found when the
    	job limit of one execution queue is reached for a part of a job
    	e.g. A,B,C; then D,E,F will "switch" to other execution queues and it 
    	never completed. The guranteed method is user need to make sure
    	that the job chain is completed at only one execution batch queue
    	from start to end. In the above testing $Start/stop queue is only
    	simulate the situation when the queues is not available (such as 
    	reach the job limit).
    
    	Herebelow is the logfiles extract from the customer:
    
    
    	This job chain contains WL_TEST1 to WL_TEST6:
    
    
    	The job started at BA$Batch_Jena and BA$Batch_Magna is stopped:
    
   
    	Job started at BA$Batch_Jena and looks okay for WL_test1
    
    Scheduler job #646 (Name: WL_TEST1) Queued on BA$BATCH Entry 108
    Scheduler job #646 (Name: WL_TEST1) Started in Batch Queue on node JENA 
    Entry 108
    Scheduler Job #646 (Name: WL_TEST1) Completed on node JENA  
    
    
    	Also normal to WL_TEST2
    
    Scheduler job #647 (Name: WL_TEST2) Queued on BA$BATCH Entry 109
    Scheduler job #647 (Name: WL_TEST2) Started in Batch Queue on node JENA 
    Entry 109
    Scheduler Job #647 (Name: WL_TEST2) Completed on node JENA  
    
    	
    	WL_test3 looks good	
    
    Scheduler job #648 (Name: WL_TEST3) Queued on BA$BATCH Entry 110
    Scheduler job #648 (Name: WL_TEST3) Started in Batch Queue on node JENA 
    Entry 110
    Scheduler Job #648 (Name: WL_TEST3) Completed on node JENA  
    
    
    	WLtest4 handled by BA$batch_magna because this queue has been started
    
    Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 111
    Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node
    MAGNA  Entry 111
    Scheduler Job #653 (Name: WL_TEST4) Completed on node MAGNA 
    
    
    	WLtest4 begin to requeue itself
    
    Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 113
    
    Scheduler job #654 (Name: WL_TEST5) Queued on BA$BATCH Entry 114
    Scheduler job #654 (Name: WL_TEST5) Started in Batch Queue on node JENA 
    Entry 114
    
    
    	WLtest4 started again
    
    Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node
    MAGNA  Entry 113
    
    
    Scheduler Job #654 (Name: WL_TEST5) Completed on node JENA  
    
    
    Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 115
    
    
    	WLtest4 keep looping
    
    Scheduler Job #653 (Name: WL_TEST4) Completed on node MAGNA 
    Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 116
    Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node JENA 
    Entry 116
    
    
    Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
    MAGNA  Entry 115
    
    
    Scheduler Job #653 (Name: WL_TEST4) Completed on node JENA  
    
    
    Scheduler job #654 (Name: WL_TEST5) Queued on BA$BATCH Entry 117
    Scheduler job #654 (Name: WL_TEST5) Started in Batch Queue on node JENA 
    Entry 117
    Scheduler Job #654 (Name: WL_TEST5) Completed on node JENA  
    
    
    Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA 
    
    
    	WL_TEST6 begin looping
    
    Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 118
    Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
    MAGNA  Entry 118
    Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA 
    
    
    	Looping
    
    Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 119
    Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
    MAGNA  Entry 119
    Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA 
    
    
    	Looping again
    
    Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 120
    Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
    MAGNA  Entry 120
    Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA 
    
    
    
    	Looping again and again
    
    Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 122
    Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
    MAGNA  Entry 122
    Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA 
    
    
    	Again again and again
    
    Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 123
    Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
    MAGNA  Entry 123
    Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA 
    
    
    	Again again........................
    
    Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 125
    Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
    MAGNA  Entry 125
    Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA 
    
    
    Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 126
    Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
    MAGNA  Entry 126
     $ lo
      SLSUSER      logged out at  9-APR-1997 15:23:34.75
    
    
    
    
    	Any hints on that?
    
    	Please feel free to contact for more information.
    
    
    Thanks & best regards,
    William