| William,
I would suggest elevating this to the CSC in your area or the one in
Colorado Springs. They are trained to provide this support and they have
second and third level support available directly from the engineers at CA. If
the CSC system with regards to POLYCENTER is broken in your area, please let us
know and we'll fix it.
Jim
|
| Thanks Eric,
After the on-site investigation, more precise picture is captured
from the customer:
Homogeneous VAXcluster OpenVMS/VAX V5.5-2; Polycenter
Scheduler V2.1B-9.
Here is two nodes name 1. Magna and 2. Jena. A clusterwide
accessible generic batch queue called: ba$batch point to node specific
sys$batch: Magna - ba$batch_magna and Jena - ba$batch_jena.
The problem is:
If we stop one of the node-specific sys$batch queue at the first
beginning (before the job executed), lets say Ba$batch_magna;
when starting to run the batch job with dependencies (either from
GUI interface or SCHED> prompt), resonably, all jobs will managed by
sys$batch and executed by BA$batch_jena.
If at some time (the job chain is A to B to C to D to E to F), for
example at C, we $start/queue BA$batch_magna; then the Scheduler will
recognise a "new" queue was detected and start to throw the job -
may be B or C randomly and handled by BA$Batch_Magna, then these jobs
keep looping and re-execute themselves UNTIL we $stop/que/next
BA$batch_Magna; then job B will fall back to BA$Batch_jena and
execute the remain job in the chain (i.e. B,C,D,E,F,G) then the chain
will stop and wait for next schedule.
There is no problem found if the "stopped" queue was not resumed;
i.e. one execution batch queue only thorough the execution. Also it is
okay if no execution queues was suspended/resumed during the job
execution. It looks like Scheduler is able to detect the queues
changes and do the load balancing or may be the lock manager? But
however the job status cannot be return correctly then lead to the
re-execute jobs without control. p.s. It's no difference which node
initiate the job.
Customer's issue is - they live environment has more and more
dependency jobs than this testing environment. They found when the
job limit of one execution queue is reached for a part of a job
e.g. A,B,C; then D,E,F will "switch" to other execution queues and it
never completed. The guranteed method is user need to make sure
that the job chain is completed at only one execution batch queue
from start to end. In the above testing $Start/stop queue is only
simulate the situation when the queues is not available (such as
reach the job limit).
Herebelow is the logfiles extract from the customer:
This job chain contains WL_TEST1 to WL_TEST6:
The job started at BA$Batch_Jena and BA$Batch_Magna is stopped:
Job started at BA$Batch_Jena and looks okay for WL_test1
Scheduler job #646 (Name: WL_TEST1) Queued on BA$BATCH Entry 108
Scheduler job #646 (Name: WL_TEST1) Started in Batch Queue on node JENA
Entry 108
Scheduler Job #646 (Name: WL_TEST1) Completed on node JENA
Also normal to WL_TEST2
Scheduler job #647 (Name: WL_TEST2) Queued on BA$BATCH Entry 109
Scheduler job #647 (Name: WL_TEST2) Started in Batch Queue on node JENA
Entry 109
Scheduler Job #647 (Name: WL_TEST2) Completed on node JENA
WL_test3 looks good
Scheduler job #648 (Name: WL_TEST3) Queued on BA$BATCH Entry 110
Scheduler job #648 (Name: WL_TEST3) Started in Batch Queue on node JENA
Entry 110
Scheduler Job #648 (Name: WL_TEST3) Completed on node JENA
WLtest4 handled by BA$batch_magna because this queue has been started
Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 111
Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node
MAGNA Entry 111
Scheduler Job #653 (Name: WL_TEST4) Completed on node MAGNA
WLtest4 begin to requeue itself
Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 113
Scheduler job #654 (Name: WL_TEST5) Queued on BA$BATCH Entry 114
Scheduler job #654 (Name: WL_TEST5) Started in Batch Queue on node JENA
Entry 114
WLtest4 started again
Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node
MAGNA Entry 113
Scheduler Job #654 (Name: WL_TEST5) Completed on node JENA
Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 115
WLtest4 keep looping
Scheduler Job #653 (Name: WL_TEST4) Completed on node MAGNA
Scheduler job #653 (Name: WL_TEST4) Queued on BA$BATCH Entry 116
Scheduler job #653 (Name: WL_TEST4) Started in Batch Queue on node JENA
Entry 116
Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
MAGNA Entry 115
Scheduler Job #653 (Name: WL_TEST4) Completed on node JENA
Scheduler job #654 (Name: WL_TEST5) Queued on BA$BATCH Entry 117
Scheduler job #654 (Name: WL_TEST5) Started in Batch Queue on node JENA
Entry 117
Scheduler Job #654 (Name: WL_TEST5) Completed on node JENA
Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA
WL_TEST6 begin looping
Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 118
Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
MAGNA Entry 118
Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA
Looping
Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 119
Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
MAGNA Entry 119
Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA
Looping again
Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 120
Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
MAGNA Entry 120
Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA
Looping again and again
Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 122
Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
MAGNA Entry 122
Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA
Again again and again
Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 123
Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
MAGNA Entry 123
Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA
Again again........................
Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 125
Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
MAGNA Entry 125
Scheduler Job #655 (Name: WL_TEST6) Completed on node MAGNA
Scheduler job #655 (Name: WL_TEST6) Queued on BA$BATCH Entry 126
Scheduler job #655 (Name: WL_TEST6) Started in Batch Queue on node
MAGNA Entry 126
$ lo
SLSUSER logged out at 9-APR-1997 15:23:34.75
Any hints on that?
Please feel free to contact for more information.
Thanks & best regards,
William
|