Title: | SCHEDULER |
Notice: | Welcome to the Scheduler Conference on node HUMANE ril |
Moderator: | RUMOR::FALEK |
Created: | Sat Mar 20 1993 |
Last Modified: | Tue Jun 03 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1240 |
Total number of notes: | 5017 |
Hallo, I got following message from a customer: -------------------------------------------------------------------------------- We encountered a problem with Polycenter Scheduler for VMS version V2.1B-7: Every evening we start a chain of batch-jobs, about 110 batch jobs divided in 30 groups, between every group we have a 'start_next_group_nr' job where all dependencies of the previous group are coming together, when all jobs of a group are completed successfully a 'start_next_group_nr' job will start the next group of batch jobs. This to explain to way we work. So our problem last Tuesday was : A GROUP OF JOBS STARTED BUT ONE JOB OF THE PREVIOUS GROUP WAS STILL RUNNING !!!! We are working with the Scheduler each day for almost a year now , but this problem didn't encounter before. After Tuesday the problem didn't happened again. You can understand that a problem like this has to be avoided in all cases !! The exact working of dependencies is one of the basics for us, without that we can not use the Scheduler. I investigated the dependencies, verified all messages with VSS_REPORTS, checked if the job was ran twice, checked if a collegue has deleted a dependency but I didn't found what went wrong. Everything seems to be OK. Is there a known problem with the dependencies of this soft version ? Is there something else that we can check ? Next_phase_4 started before arch_chk_begin was finished Job Name Entry User_name State Next Run Time -------- ----- --------- ----- ------------- NEXT_PHASE_4 527 ARCHMGR DEP Wait 28-MAR-1996 18:04 VMS_Command : request/to='class' "NEXT_PHASE_4 has started" Group : ARCH_BATCH Type : (none) Last Start Time : 28-MAR-1996 18:04 Last Finish Time : 28-MAR-1996 18:04 Last Exit Status : SUCCESS Schedule Interval : 0 Mode : Batch Mail to : OPERDU,OPERVD,OPERGVH (No Mail) Days : ALL Output File : None Cluster_CPU : <Ignored> Notify user upon completion Submit Queue : ARCH$BATCH CPULimit (x100ms) : 0 QPriority : 101 Max_Time Warning : None Job Always retained Stall Notify : None No Retry on Error Success Count : 99 Failure Count : 0 Owner UIC : [100,1] No Restart on Crash Execute Identifier: SCHED_EXECUTE Write Identifier : SCHED_WRITE Send Opcom Completion Message Pre Function : class==f$trnlnm("ARCH$OPCLASS"), Last Exit Status : SUCCESS Post Function : (none) This job has 3 local job(s) that depend upon it: (ARCH_UPD_PRICES, ARCH_CVT_ACCT, ARCH_RESOURCE) All dependencies must successfully complete after: 28-MAR-1996 18:04:56.11 Job Dependencies: (ARCH_CHK_BEGIN, ARCH_COL_PR_CHK, ARCH_AUDIT, ARCH_SUS_CHK, ARCH_PR_CHK) DATE JOB NUMBER EVENT-TYPE NODE PID EXIT-STATUS INFO ---------------------- --------------------- ----------- ------ ------ - ------------ ------------------------------- 25-MAR-1996 05:30:01 QTRMON_ELN02 866 JOB QUEUED ARCH01 entry=11 queue=ARCH02$BATCH 25-MAR-1996 05:30:03 QTRMON_ELN01 867 JOB QUEUED ARCH01 entry=12 queue=ARCH01$BATCH 25-MAR-1996 05:30:05 863 JOB QUEUED ARCH01 entry=13 queue=ARCH01$BATCH 25-MAR-1996 05:30:07 862 JOB QUEUED ARCH01 entry=14 queue=ARCH02$BATCH 25-MAR-1996 05:30:08 QTRMON_ELN02 866 JOB START ARCH01 2041F77 2 25-MAR-1996 05:30:09 862 JOB START ARCH01 2041E27 3 25-MAR-1996 05:30:11 QTRMON_ELN02 866 [1mJOB FINISH [m ARCH02 2041F772 [1mSuccess [m 25-MAR-1996 05:30:12 QTRMON_ELN01 867 JOB START ARCH01 2021EF7 B 25-MAR-1996 05:30:12 862 [1mJOB FINISH [m ARCH02 2041E273 [1mSuccess [m 25-MAR-1996 05:30:13 863 JOB START ARCH01 2021EC7 E 25-MAR-1996 05:30:17 QTRMON_ELN01 867 [1mJOB FINISH [m ARCH01 2021EF7B [1mSuccess [m 25-MAR-1996 05:30:18 863 [1mJOB FINISH [m ARCH01 2021EC7E [1mSuccess [m 25-MAR-1996 17:06:41 ARCH_SER_CREATE 666 JOB QUEUED ARCH01 entry=547 queue=LOWPRIO$BATCH 25-MAR-1996 17:06:48 ARCH_SER_CREATE 666 JOB START ARCH01 2021F15 7 25-MAR-1996 17:14:40 ARCH_SER_CREATE 666 [1mJOB FINISH [m ARCH01 2021F157 [1mSuccess [m 25-MAR-1996 17:16:37 ARCH_PRE_CHK 665 JOB QUEUED ARCH01 entry=551 queue=ARCH02$BATCH 25-MAR-1996 17:16:49 ARCH_PRE_CHK 665 JOB START ARCH01 2041DB4 F 25-MAR-1996 17:19:09 ATS_SHUTDOWN_ELN02 506 JOB QUEUED ARCH01 entry=557 queue=ARCH02$BATCH 25-MAR-1996 17:19:23 ATS_SHUTDOWN_ELN02 506 JOB START ARCH01 2041FB5 E 25-MAR-1996 17:19:43 ARCH_PRE_CHK 665 [1mJOB FINISH [m ARCH02 2041DB4F [1mSuccess [m 25-MAR-1996 17:20:17 ATS_SHUTDOWN_ELN01 507 JOB QUEUED ARCH01 entry=560 queue=ARCH01$BATCH 25-MAR-1996 17:20:20 ATS_SHUTDOWN_ELN01 507 JOB START ARCH01 2021FCA 5 25-MAR-1996 17:28:19 ATS_SHUTDOWN_ELN02 506 [1mJOB FINISH [m ARCH02 2041FB5E [1mSuccess [m 25-MAR-1996 17:28:24 ATS_HOUSEKEEP_ELN02 510 JOB QUEUED ARCH01 entry=570 queue=ARCH02$BATCH 25-MAR-1996 17:28:26 ATS_HOUSEKEEP_ELN02 510 JOB START ARCH01 2041FBA A 25-MAR-1996 17:30:11 START_SHUTMEM 505 JOB QUEUED ARCH01 entry=573 queue=ARCH$BATCH 25-MAR-1996 17:30:17 START_SHUTMEM 505 JOB START ARCH01 2021FA3 B 25-MAR-1996 17:30:18 START_SHUTMEM 505 [1mJOB FINISH [m ARCH01 2021FA3B [1mSuccess [m 25-MAR-1996 17:30:24 ARCH_SHUTMEM_01 508 JOB QUEUED ARCH01 entry=576 queue=ARCH01$BATCH 25-MAR-1996 17:30:26 ARCH_SHUTMEM_02 509 JOB QUEUED ARCH01 entry=578 queue=ARCH02$BATCH 25-MAR-1996 17:30:33 ARCH_SHUTMEM_01 508 JOB START ARCH01 2021F84 0 25-MAR-1996 17:30:34 ARCH_SHUTMEM_02 509 JOB START ARCH01 2041FAB 8 25-MAR-1996 17:30:38 ARCH_SHUTMEM_02 509 [1mJOB FINISH [m ARCH02 2041FAB8 [1mSuccess [m 25-MAR-1996 17:31:54 ARCH_SHUTMEM_01 508 [1mJOB FINISH [m ARCH01 2021F840 [1mSuccess [m 25-MAR-1996 17:32:09 STARTBATCH 504 JOB QUEUED ARCH01 entry=582 queue=ARCH$BATCH 25-MAR-1996 17:32:19 STARTBATCH 504 JOB START ARCH01 2021F75 9 25-MAR-1996 17:32:20 STARTBATCH 504 [1mJOB FINISH [m ARCH01 2021F759 [1mSuccess [m 25-MAR-1996 17:32:24 NEXT_PHASE_2 512 JOB QUEUED ARCH01 entry=583 queue=ARCH$BATCH 25-MAR-1996 17:32:26 ATS_SHUTDOWN_ELN01 507 [1mJOB FINISH [m ARCH01 2021FCA5 [1mSuccess [m 25-MAR-1996 17:32:30 ATS_HOUSEKEEP_ELN01 511 JOB QUEUED ARCH01 entry=584 queue=ARCH01$BATCH 25-MAR-1996 17:32:32 NEXT_PHASE_2 512 JOB START ARCH01 2021705 A 25-MAR-1996 17:32:34 ATS_HOUSEKEEP_ELN01 511 JOB START ARCH01 20219B5 C 25-MAR-1996 17:32:35 NEXT_PHASE_2 512 [1mJOB FINISH [m ARCH01 2021705A [1mSuccess [m 25-MAR-1996 17:32:41 ARCH_SHUTDWN_02 515 JOB QUEUED ARCH01 entry=585 queue=ARCH02$BATCH 25-MAR-1996 17:32:44 ARCH_SHUTDWN_01 516 JOB QUEUED ARCH01 entry=586 queue=ARCH01$BATCH 25-MAR-1996 17:32:50 ARCH_SHUTDWN_02 515 JOB START ARCH01 204200B D 25-MAR-1996 17:32:52 ARCH_SHUTDWN_02 515 [1mJOB FINISH [m ARCH02 204200BD [1mSuccess [m 25-MAR-1996 17:32:53 ARCH_SHUTDWN_01 516 JOB START ARCH01 2021F05 E 25-MAR-1996 17:33:02 ARCH_SHUTDWN_01 516 [1mJOB FINISH [m ARCH01 2021F05E [1mSuccess [m 25-MAR-1996 17:33:09 NEXT_PHASE_3 519 JOB QUEUED ARCH01 entry=587 queue=ARCH$BATCH 25-MAR-1996 17:33:24 NEXT_PHASE_3 519 JOB START ARCH01 2021E96 7 25-MAR-1996 17:33:33 NEXT_PHASE_3 519 [1mJOB FINISH [m ARCH01 2021E967 [1mSuccess [m 25-MAR-1996 17:33:38 ARCH_CHK_BEGIN 522 JOB QUEUED ARCH01 entry=588 queue=ARCH$BATCH 25-MAR-1996 17:33:41 ARCH_COL_PR_CHK 523 JOB QUEUED ARCH01 entry=589 queue=ARCH$BATCH 25-MAR-1996 17:33:44 ARCH_AUDIT 524 JOB QUEUED ARCH01 entry=590 queue=ARCH$BATCH 25-MAR-1996 17:33:48 ARCH_SUS_CHK 525 JOB QUEUED ARCH01 entry=591 queue=ARCH$BATCH 25-MAR-1996 17:33:51 ARCH_PR_CHK 526 JOB QUEUED ARCH01 entry=592 queue=ARCH$BATCH 25-MAR-1996 17:33:52 ARCH_COL_PR_CHK 523 JOB START ARCH01 2041FF0 3 25-MAR-1996 17:34:06 ARCH_CHK_BEGIN 522 JOB START ARCH01 2021D06 E 25-MAR-1996 17:34:14 ARCH_COL_PR_CHK 523 [1mJOB FINISH [m ARCH02 2041FF03 [1mSuccess [m 25-MAR-1996 17:34:14 ARCH_SUS_CHK 525 JOB START ARCH01 2042020 5 25-MAR-1996 17:34:17 ATS_HOUSEKEEP_ELN02 510 [1mJOB FINISH [m ARCH02 2041FBAA [1mSuccess [m 25-MAR-1996 17:34:25 ATS_OPENORD_ELN02 513 JOB QUEUED ARCH01 entry=593 queue=ARCH02$BATCH 25-MAR-1996 17:34:25 ARCH_PR_CHK 526 JOB START ARCH01 2021FD7 2 25-MAR-1996 17:34:26 ARCH_AUDIT 524 JOB START ARCH01 2021E57 1 25-MAR-1996 17:34:28 ARCH_SUS_CHK 525 [1mJOB FINISH [m ARCH02 20420205 [1mSuccess [m 25-MAR-1996 17:34:34 ATS_OPENORD_ELN02 513 JOB START ARCH01 2042002 2 25-MAR-1996 17:35:08 ATS_OPENORD_ELN02 513 [1mJOB FINISH [m ARCH02 20420022 [1mSuccess [m 25-MAR-1996 17:35:14 ATS_ORDERLIM_ELN02 853 JOB QUEUED ARCH01 entry=594 queue=ARCH02$BATCH 25-MAR-1996 17:35:16 ATS_ORDERLIM_ELN02 853 JOB START ARCH01 2041FC2 4 25-MAR-1996 17:35:39 ATS_ORDERLIM_ELN02 853 [1mJOB FINISH [m ARCH02 2041FC24 [1mSuccess [m 25-MAR-1996 17:35:43 ATS_END_OF_DAY_ELN02 517 JOB QUEUED ARCH01 entry=596 queue=ARCH02$BATCH 25-MAR-1996 17:35:45 ATS_END_OF_DAY_ELN02 517 JOB START ARCH01 2042023 1 25-MAR-1996 17:36:12 ATS_END_OF_DAY_ELN02 517 [1mJOB FINISH [m ARCH02 20420231 [1mSuccess [m 25-MAR-1996 17:36:17 ATS_CREDIR_ELN02 520 JOB QUEUED ARCH01 entry=597 queue=ARCH02$BATCH 25-MAR-1996 17:36:20 ATS_CREDIR_ELN02 520 JOB START ARCH01 2041FD3 A 25-MAR-1996 17:36:31 ATS_CREDIR_ELN02 520 [1mJOB FINISH [m ARCH02 2041FD3A [1mSuccess [m 25-MAR-1996 17:38:37 ARCH_AUDIT 524 [1mJOB FINISH [m ARCH01 2021E571 [1mSuccess [m >>>>25-MAR-1996 17:42:28 ARCH_PR_CHK 526 [1mJOB FINISH [m ARCH 01 2021FD72 [1mSuccess [m >>>>25-MAR-1996 17:42:33 NEXT_PHASE_4 527 JOB QUEUED ARCH01 entry=603 queue=ARCH$BATCH >>>>25-MAR-1996 17:42:39 NEXT_PHASE_4 527 JOB START ARCH01 204 1F86D >>>>25-MAR-1996 17:42:41 NEXT_PHASE_4 527 [1mJOB FINISH [m ARCH 02 2041F86D [1mSuccess [m 25-MAR-1996 17:42:46 ARCH_UPD_PRICES 528 JOB QUEUED ARCH01 entry=604 queue=ARCH$BATCH 25-MAR-1996 17:42:49 ARCH_CVT_ACCT 529 JOB QUEUED ARCH01 entry=605 queue=ARCH$BATCH 25-MAR-1996 17:42:51 ARCH_RESOURCE 530 JOB QUEUED ARCH01 entry=606 queue=ARCH$BATCH 25-MAR-1996 17:42:57 ARCH_UPD_PRICES 528 JOB START ARCH01 2041FA6 F 25-MAR-1996 17:43:01 ARCH_RESOURCE 530 JOB START ARCH01 2021F8B 9 25-MAR-1996 17:43:12 ARCH_CVT_ACCT 529 JOB START ARCH01 2042027 1 25-MAR-1996 17:43:36 ARCH_UPD_PRICES 528 [1mJOB FINISH [m ARCH02 2041FA6F [1mSuccess [m 25-MAR-1996 17:43:38 ARCH_CVT_ACCT 529 [1mJOB FINISH [m ARCH02 20420271 [1mSuccess [m 25-MAR-1996 17:45:22 ATS_HOUSEKEEP_ELN01 511 [1mJOB FINISH [m ARCH01 20219B5C [1mSuccess [m 25-MAR-1996 17:45:27 ATS_OPENORD_ELN01 514 JOB QUEUED ARCH01 entry=608 queue=ARCH01$BATCH 25-MAR-1996 17:45:30 ATS_OPENORD_ELN01 514 JOB START ARCH01 2021F82 0 25-MAR-1996 17:49:12 ARCH_RESOURCE 530 [1mJOB FINISH [m ARCH01 2021F8B9 [1mSuccess [m 25-MAR-1996 18:02:59 ATS_OPENORD_ELN01 514 [1mJOB FINISH [m ARCH01 2021F820 [1mSuccess [m 25-MAR-1996 18:03:05 ATS_ORDERLIM_ELN01 852 JOB QUEUED ARCH01 entry=631 queue=ARCH01$BATCH 25-MAR-1996 18:03:11 ATS_ORDERLIM_ELN01 852 JOB START ARCH01 2021F3A A >>>>25-MAR-1996 18:04:17 ARCH_CHK_BEGIN 522 [1mJOB FINISH [m ARCH01 2021D06E [1mSuccess [m 25-MAR-1996 18:05:53 ATS_ORDERLIM_ELN01 852 [1mJOB FINISH [m ARCH01 2021F3AA [1mSuccess [m 25-MAR-1996 18:05:59 ATS_END_OF_DAY_ELN01 518 JOB QUEUED ARCH01 entry=633 queue=ARCH01$BATCH 25-MAR-1996 18:06:02 ATS_END_OF_DAY_ELN01 518 JOB START ARCH01 2021320 5 25-MAR-1996 18:06:39 ATS_END_OF_DAY_ELN01 518 [1mJOB FINISH [m ARCH01 20213205 [1mSuccess [m 25-MAR-1996 18:06:44 ATS_CREDIR_ELN01 521 JOB QUEUED ARCH01 entry=635 queue=ARCH01$BATCH 25-MAR-1996 18:06:47 ATS_CREDIR_ELN01 521 JOB START ARCH01 2021F11 C 25-MAR-1996 18:07:01 ATS_CREDIR_ELN01 521 [1mJOB FINISH [m ARCH01 2021F11C [1mSuccess [m 25-MAR-1996 18:14:40 NEXT_PHASE_5 531 JOB QUEUED ARCH01 entry=638 queue=ARCH$BATCH 25-MAR-1996 18:14:47 NEXT_PHASE_5 531 JOB START ARCH01 2022016 1 25-MAR-1996 18:14:49 NEXT_PHASE_5 531 [1mJOB FINISH [m ARCH01 20220161 [1mSuccess [m 25-MAR-1996 18:14:53 ARCH_CKPDB1 532 JOB QUEUED ARCH01 entry=639 queue=ARCH02$BATCH 25-MAR-1996 18:15:08 ARCH_CKPDB1 532 JOB START ARCH01 2042041 D 25-MAR-1996 18:40:57 ARCH_CKPDB1 532 [1mJOB FINISH [m ARCH02 2042041D [1mSuccess [m 25-MAR-1996 18:41:04 ARCH_CKP1 533 JOB QUEUED ARCH01 entry=641 queue=ARCH$BATCH 25-MAR-1996 18:41:07 NEXT_PHASE_6 534 JOB QUEUED ARCH01 entry=642 queue=ARCH$BATCH 25-MAR-1996 18:41:12 NEXT_PHASE_6 534 JOB START ARCH01 2041EF3 0 25-MAR-1996 18:41:14 NEXT_PHASE_6 534 [1mJOB FINISH [m ARCH02 2041EF30 [1mSuccess [m 25-MAR-1996 18:41:18 ARCH_IN_MONEY 535 JOB QUEUED ARCH01 entry=643 queue=ARCH$BATCH 25-MAR-1996 18:41:21 ARCH_SUS_FEES 536 JOB QUEUED ARCH01 entry=644 queue=ARCH$BATCH 25-MAR-1996 18:41:22 ARCH_CKP1 533 JOB START ARCH01 2021F58 7 25-MAR-1996 18:41:25 ARCH_IN_MONEY 535 JOB START ARCH01 2042053 1 25-MAR-1996 18:41:30 ARCH_SUS_FEES 536 JOB START ARCH01 2021F98 8 25-MAR-1996 18:41:34 ARCH_IN_MONEY 535 [1mJOB FINISH [m ARCH02 20420531 [1mSuccess [m 25-MAR-1996 18:41:37 ARCH_SUS_FEES 536 [1mJOB FINISH [m ARCH01 2021F988 [1mSuccess [m -------------------------------------------------------------------------------- Unfortunately we have no debugging information. Johan.
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
1080.1 | MUZICK::KUNG | Wed Apr 24 1996 10:51 | 21 | ||
Johan: It would be helpful to have an extract of the predecessor job as well to see what its "completion" time was. I would also like to know how the customer knew that the dependent jobs started running WHILE a predecessor job was still running, did they see this through the graphical user interface or through the history log? Reason I ask is because, in simple terms, the Scheduler prioritizes its tasks, and a task such as triggering a job to run comes in at a higher priority than, let's say, updating the user interface displays and real-time displays. So, if there are a large number of jobs running in a dependency pattern, sometimes the user will actually "see" the dependent jobs go into run mode before the predecessor jobs have completed, on the graphical user interface (I have seen this). This doesn't mean that the jobs ran out of order, it's just that the Scheduler is more keen on making sure that jobs are actually started, and that updating the real-time displays for job completion falls in at a lower priority. Hope this helps... | |||||
1080.2 | list ef events | BACHUS::WILLEMS | Johan Willems @BRO DTN 856-8739 | Thu Apr 25 1996 03:59 | 27 |
When you look to the log file (It is too late for the predecessor job information, because this stream runs every day), you will see the following: 17:33:38 ARCH_CHK_BEGIN 522 JOB QUEUED 17:33:41 ARCH_COL_PR_CHK 523 JOB QUEUED 17:33:44 ARCH_AUDIT 524 JOB QUEUED 17:33:48 ARCH_SUS_CHK 525 JOB QUEUED 17:33:51 ARCH_PR_CHK 526 JOB QUEUED 17:33:52 ARCH_COL_PR_CHK 523 JOB START 17:34:06 ARCH_CHK_BEGIN 522 JOB START 17:34:14 ARCH_COL_PR_CHK 523 JOB FINISH 17:34:14 ARCH_SUS_CHK 525 JOB START 17:34:25 ARCH_PR_CHK 526 JOB START 17:34:26 ARCH_AUDIT 524 JOB START 17:34:28 ARCH_SUS_CHK 525 JOB FINISH 17:38:37 ARCH_AUDIT 524 JOB FINISH 17:42:28 ARCH_PR_CHK 526 JOB FINISCH 17:42:33 NEXT_PHASE_4 527 JOB QUEUED <<<<<<< 17:42:39 NEXT_PHASE_4 527 JOB START <<<<<<< 17:42:41 NEXT_PHASE_4 527 JOB FINISH <<<<<<< 18:04:17 ARCH_CHK_BEGIN 522 JOB FINISH <<<<<<< Is this any cleared???? Johan | |||||
1080.3 | MUZICK::RCOLLIER | Wed May 01 1996 14:34 | 21 | ||
I can think of two things that could have happened here: (1) The Scheduler was running on a VERY slow systems and it didn't update the job record in time. 20 minutes is way too long though. (2) Job 522 already ran before and set the flag in the dependent job as having completed. Doing a SCHED SHOW JOBS/FULL will show the job and its dependencies, and if a dependency has been satisfied, then there will be square brackets shown around the job name. In the few cases that we have seen this reported, it has been because of either scenarios listed above, or a corrupted dependency.dat database, which could have lost the dependency relationship link between the two jobs. If this has not happened again, then the database is probably in good shape. If the user wanted to be absolutely sure about the link being in place, they can delete and then re-establish the dependency link (SCHED MOD JOB/SYNC= is the command) between the two impacted jobs. |