T.R | Title | User | Personal Name | Date | Lines |
---|
1234.1 | Are all batch queues accessible ? | RUMOR::FALEK | ex-TU58 King | Wed Apr 02 1997 16:53 | 8 |
| Are you talking here about "batch mode" (ie using batch queues) jobs ?
Or "detached" mode ?
If batch mode, make sure that the batch queues on both VAX and Alpha
are properly visible to the scheduler. Are detached mode jobs (no batch
queues) OK?
So far as I know, this is supposed to work.
|
1234.2 | | TAV02::GALIA | Galia Reznik, Israel Software Support | Mon Apr 07 1997 06:06 | 45 |
| Hi,
I do talk here about batch queues. They don't have detached mode jobs.
The queues are visible (in SCHEDULER) from VAX to Alpha and from Alpha
to VAX. As I mentioned in .0, they CAN send a job to a que, and it
starts executing, but it never ends. It ends when the Alpha is the
default node.
I want to attach a part of the log file in DEBUG mode, which may help
to trace the problem. This job, 3912, was sent from VAX to Alpha
queue. Please note, that in each place where should be the Alpha's
name, they got an empty place. Of course, Alpha's nodename in VMS is
defined ok - in SCS and in NCP. Where should it be defined in SCHEDULER
when the node is not the default one?
And even though the SCHEDULER claims the entry ended, it never ends in
the queue, it hangs executing an image (pls see .0).
Thanks,
Galia.
DEBUG log-file:
----------------
we woke up!
got mbx msg 'BS3912 '
Job start message for job# 3912
queue_job lock returned 1 lock id=1800728C
CLUSTER_BROADCAST: node= msg=B+ <----------
told to update count <----------
got term mbx msg 'BE-785'
job end message for pid FFFFFCEF
CLUSTER_BROADCAST: node= msg=B- <----------
told to adjust count <----------
job status of ended batch job is Q
job end status= 196609
DEQ_JOB_LOCK returned 1
job # 3912 finished.... count= 1
0 remote nodes care about job 3912
10:18 AM processing record # 3912 status= S request=
Now= 2-Apr-1997 10:18:49.88 job_sched_time= 2-Apr-1997 14:18:45.81
job 3912 is scheduled for the future
10:18 AM updated record # 3912 status= S request=
Found 0 local jobs depending on :: 3912
timer flag was clear
timer not expired. No earlier event to set.
sleeping
|
1234.3 | a problem with logical SYS$NODE ? | RUMOR::FALEK | ex-TU58 King | Mon Apr 07 1997 21:52 | 16 |
| Aha ! The nodename being missing is certainly related to this problem!
The message telling the "default" scheduler that the job ended is
getting lost.
When NSCHED.EXE starts, it finds out its nodename by translating
logical SYS$NODE If you do a $ sched show status
on that cluster, do the nodenames appear in the display ?
Make sure that logical SYS$NODE is properly defined on both machines.
There is probably a logical name (or else NSCHED wouldn't start) but it
may have the wrong stuff in it.
If this is DECnet phase V, make sure the logical has the phase 4 alias
(6 characters, maximum).
Then stop and restart the schedulers on both nodes.
Is the problem still occuring ?
|
1234.4 | | TAV02::GODOVNIK | Haim Godovnik | Thu Apr 10 1997 09:52 | 18 |
|
Hi,
I am stepping in for Galia as She is on vacation.
The scheduler starts after DECNET and the logical name SYS$NODE is defined
correctly on all nodes.
They do not use DECNET Phase V. They have asked restarted the scheduler on all
on all nodes but nothing changed. They also defined the scheduler object in ncp.
What else should we check? Did the scheduler database change between version
2.1A and 2.1B?
Thanks for Your help,
Haim G.
|
1234.5 | one additional question | RUMOR::FALEK | ex-TU58 King | Thu Apr 10 1997 17:51 | 3 |
| If you do a $ sched show status
do you see a good display (i.e all the nodenames are shown) ?
|
1234.6 | | TAV02::GODOVNIK | Haim Godovnik | Sun Apr 13 1997 03:36 | 12 |
|
Hi,
On the sched sho stat display He sees both ALPHA and VAX nodes. He added
another ALPHA to the cluster and between the ALPHA's everything works fine.
The problems are between VAX and ALPHA.
Thanks,
Haim G.
|
1234.7 | plan of attack | RUMOR::FALEK | ex-TU58 King | Mon Apr 14 1997 19:33 | 45 |
| Ok, I'm running out of ideas... Lets summarize what we know
The problem occurs when the "default" scheduler is running on an Alpha,
but the batch execution queue is on a VAX
The job actually runs and completes, but the scheduler system never
detects that fact - so it "thinks" it is still running.
The debug info shows that the "batch end" (BE) message is being
broadcast to a scheduler with a null node name. Actually, on
reflection, I now think this might actually be normal, since the "default"
scheduler hears all the messages, and is supposed to react to ones
where no nodename is specified - no node means the "default". So we may
have been barking up the wrong tree with the "no node name" thing.
We know a "batch end" message is getting sent when the job completes...
So the question is, does the default scheduler actually GET this
message, and if so, what does it do with it ?
To answer this question you could
1. put all scheduler jobs that are likely to run accidently during the
experiment on hold. Preferably, wait for all running jobs to finish.
2. Stop all schedulers in the cluster $ sched stop/all
3. On a hardcopy terminal or a screen where you can watch the output,
on the Alpha system, $ run nsched$:nsched.exe It will print a lot of
stuff as it reads thorugh all the jobs and then it will print
"Sleeping..."
4. Start the scheduler on the VAX
(The Alpha NSCHED will notice it started, you will see some output
and then it will print "Sleeping..." again)
5. Run a Scheduler batch-mode job on the VAX. You will see some stuff
print on the Alpha when the job starts. Then the Alpha will print
"Sleeping...". When the job completes on the VAX, watch CAREFULLY what
(if anything) the scheduler on the Alpha prints.
If it doesn't print anything at all, then the BE message isn't being
processed (valuable information). If it does print something then that
will tell excatly what's going on - WHAT DOES IT SAY ?
|
1234.8 | | TAV02::GODOVNIK | Haim Godovnik | Tue Apr 15 1997 08:27 | 14 |
|
Hi,
Thank You for Your help.
I have asked the customer to do the tests You have asked in .7.
After the job completes He gets nothing on the screen after the last
"Sleeping..." message. Which means that the BE message is not being processed.
He also tried DETTACHED mode jobs and everything worked fine. The problem seems
to be only in BATCH mode.
Haim G.
|
1234.9 | likely a bug that must be escalated | RUMOR::FALEK | ex-TU58 King | Tue Apr 15 1997 15:07 | 14 |
| Its probably a bug then, and most likely not a known one though I can't
be sure. Batch jobs in heterogeneous VMSclusters are supposed to work!
You've already gathered information that shows approximately what step
in the job processing mechanism is failing. I hoped it would be
something simple, like a queue file or logical name problem that I could
suggest a fix for. I suspect this might be a bug that requires a patch.
Unfortunately, I'm not a member of product engineering. However, the
information you've already gathered should be very useful to them.
You need to escalate this through official support channels.
They need to search through their database to see if this is
a known problem. They (the support org.) needs to figure out
exactly why this is broken at your site and supply a fix !
|
1234.10 | | ZEKE::BURTON | Jim Burton, DTN 381-6470 | Tue Apr 15 1997 16:51 | 5 |
| If you need to know the official escalation channel in your area, please
contact Curtis Chase @OGO.
Jim
Scheduler Product Manager
|
1234.11 | Problem solved | TAV02::GODOVNIK | Haim Godovnik | Thu Apr 17 1997 06:23 | 16 |
|
Hi,
Before escalating the problem I have asked the customer to upgrade the VAX
from 2.1A to 2.1B. After the installation everything works fine. I do not
know if something related to this was fixed in 2.1B or the reinstall simply
solved it.
Thank You very much for all Your help,
Haim Godovnik,
CSC Israel
|