[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | SCHEDULER |
Notice: | Welcome to the Scheduler Conference on node HUMANE ril |
Moderator: | RUMOR::FALEK |
|
Created: | Sat Mar 20 1993 |
Last Modified: | Tue Jun 03 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1240 |
Total number of notes: | 5017 |
1058.0. "Alpha server communicating with Alpha agent, job always in a running state." by BSS::FLANNERY () Wed Apr 03 1996 15:17
We have now two customers who are experiencing the issue of jobs
remaining in a running state on alpha server with alpha agents.
If we have two AXP's and AXP AGENT and an AXP SERVER. Agent running 2.1b-5
server running 2.1b-7. The other system is running an older version
of scheduler and the agensts
If we run a job, with debug on, we see the job in a running state,
we see the pid on the agent with a sched sho job/fu. we see the log
file of the job finished on the agent with logout stats. the job completed
correctly.
If the agent is down and we try to run the job we immediately get back
an error, if we do not have the proxy set up correctly we get an error
immediately.
If the job however runs and the vms command is asdf, the job will error
out on the agent however the job remains in a running state. Here is
the debug log turned on with debug set at six. Too high I know.
From the log file where is the information where we attempt to send
a termination message.....
Ip address is correct for each node.
Thanks
Ed Flannery
538971437: gp: forwarding to port: 5482 address: 144.60.199.120<CR>
538971437: in ueu_send_to_address<CR>
538971437: connect: 1, shutdown: 1, close: 1<CR>
538971437: in ueu_add_id_to_packet<CR>
538971437: adding packet id: 828522196 to the packet<CR>
538971437: adding pid: 538971437 to the packet<CR>
put packet<CR>
item:<CR>
code: 29<CR>
length: 4<CSI>19;H current length: 260<CR>
item:<CR>
code: 30<CSI>22;H<CSI>;7m Buffer: NSCHED$AGENT.LOG
length: 4<CR>
current length: 280<CR>
538971437: in ueu_send_one_message<CR>
538971437: connect: 1, shutdown: 1, close: 1<CR>
538971437: send socket is 0; init'ing a temporary socket<CR>
538971437: UEU INIT<CR>
538971437: opened requested socket 4
538971437: errno: 61 <connection refused ><CR>
538971437: retry 0 of 3<CR>
538971437: error on connect to socket 4
38D<CSI>4h<CSI>4l^Nh^O38971437: err
4l538971437: errno: 22 <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 1 of 3<CR>
538971437: error on connect to socket 4<CR>
538971437: errno: 22 <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 2 of 3<CR>
538971437: error on connect to socket 4<CR>
538971437: errno: 22 <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 3 of 3<CR>
538971437: final error on connect to socket 4<CR>
538971437: s0<CR>
538971437: job 80 node DC7A1 (pid 538971442) terminated with status 1<CSI>67D
<CSI>4l538971437: failed to send termination message to sched node<CR>
538971437: storing job rec<CR>
in modify job rec, key=1<CR>
in read job rec<CR>
38971437: job 80 node DC7A1 (pid 538971442) terminated with status 1<CSI>67D
<CSI>4l538971437: failed to send termination message to sched node<CR>
538971437: storing job rec<CR>
in modify job rec, key=1<CR>
in read job rec<CR>
looking for job 80, pid 538971442 matching<CR>
l_found: Num:80 Pid:538971442 Active:1 Node:DC7A1<CR>
MATCHED: PID=538971442 JOB#=80 NODE=DC7A1<CR>
record updated: 65537<CR>
538971437: gp: in process_pending; adding node to pending que<CR>
538971437: gp: que head: 2659800<CR>
538971437: gp: que tail: 2659800<CR>
538971437: gp: pend node: 2662712<CR>
538971437: gp: que head: 2659800<CR>
538971437: gp: que tail: 2662712<CR>
538971437: gp: status from ues_process_packet: 0<CSI>49D<CSI>4h<CR>
<CSI>4l538971437: gp: in dispatch_pending<CR>
538971437: gp: next pending node has not yet timed out<CR>
538971437: gp: status from ues_dispatch_pending: 0<CR>
538971437: i_rcv_socket: 3<CR>
538971437: UEU RCV: waiting for IO on socket 3<CR>
538971437: ueu_wait_for: starting<CR>
538971437: ueu_wait_for - timeout: 28/0<CR>
538971437: ueu_wait_for: calling select<CR>
538971437: ueu_wait_for: select done<CR>
538971437: ueu_wait_for: io received; returning<CR>
538971437: received connection on new socket 4 from node 144.60.199.120 port 11
538971437: ueu_receive_one_message: starting<CR>
538971437: ueu_receive_one_message: block SIGCLD<CR>
538971437: ueu_receive_one_message: recv on socket 4<CR>
538971437: ueu_receive_one_message: reset signals<CSI>47D<CSI>4h<CR>
<CSI>4l538971437: 140 bytes read from socket 4 port: 1121 address: 144.60.199
538971437: ueu_receive_one_message: about to shutdown socket<CR>
538971437: ueu_receive_one_message: about to close socket 4<CR>
538971437: ueu_receive_one_message: returning<CR>
538971437: Dump Packet:<CR>
T.R | Title | User | Personal Name | Date | Lines |
---|
1058.1 | How about a pointer on this call is there any additional information that I can supply ? | BSS::FLANNERY | | Fri Apr 12 1996 10:24 | 11 |
| I know you are all busy, but we do have a serious issue here.
Perhaps I did not supply all the information necessary to turn on a "Light bulb" on this one,
is there any additional information that I can supply that will assist in the solving of
this issue?
Perhaps an IPMT case is needed, since this is quite an impact to these customers?
Thanks
Ed Flannery
|
1058.2 | | MRBASS::PUISHYS | Project Leader Scheduler V3.0 for Digital UNIX | Fri Apr 12 1996 12:42 | 1 |
| file an ipmt case please
|
1058.3 | I THINK WE FOUND AN ANWER, THE JOB always in a RUNNING STATE | BSS::FLANNERY | | Wed Apr 24 1996 18:36 | 82 |
|
FOLKS:
I think we found the problem.....
We have the following configuration
_______________________Fiddi ______________________
| |
0SERVER 0 AGENT
________|_______________|____________Ethernet____________
We can communicate between the server and the agent using the fiddi address
but we can not communicate between the agent and the fiddi using the fiddi
address, using the command
$ telnet fiddi agent address to port 5481 or 5481
What we beleive we find is that scheduler is looking only at the
address in the logical ucx$inet_hostaddr. In the above configuration,
there must be two addresses ucx$inet_hostaddr and ucx$inet_host_addr2.
In our case, on the agent
UCX$INET_HOSTADDR" = "206.101.128.196" (fddi address)
"UCX$INET_HOSTADDR2" = "206.101.129.196" (ethernet address)
Yet on the server the following was true.
ucx$inet_hostaddr = 206.101.129.197 (ethernet)
ucx$inet_hostaddr2 = "206.101.128.197" (Fiddi)
WE simply changed these logical so that the FIDDI logical was first that is on the
server, we have the following
ucx$inet_hostaddr = 206.101.128.197 (FIDDI)
ucx$inet_hostaddr2 = "206.101.129.197" (ETHERNET)
I am wondering if the agent or scheduler is looking only for the logical
ucx$inet_hostaddr and not taking into consideration the LOGICAL UCX$INET_addr2
thus not attempting to communicate back via ethernet.
The following crazy output from the debug log file is quite a concern.
What are the error 22, why the invalid argument, what argument?
5538973883: connect flag TRUE; connecting socket 4 to port 0 at node 206.101.128.197
538973883: error on connect to socket 4
538973883: opened requested socket 4
538973883: port requested is 0; assuming sending port: no bind
538973883: connect flag TRUE
538973883: connect flag TRUE; connecting socket 4 to port 0 at node 206.101.
128.197
538973883: error on connect to socket 4
538973883: errno: 61 <connection refused >
538973883: retry 0 of 3
538973883: error on connect to socket 4
538973883: errno: 22 <invalid argument>
538973883: unplanned errno: 22
538973883: retry 1 of 3
538973883: error on connect to socket 4
538973883: errno: 22 <invalid argument>
538973883: unplanned errno: 22
5538973883: retry 2 of 3
538973883: error on connect to socket 4
538973883: errno: 22 <invalid argument>
538973883: unplanned errno: 22
538973883: retry 3 of 3
538973883: final error on connect to socket 4
538973883: send failed
538973883: pending: sending of termination message failed
538973883: pending: placing pending node back into queue with new timeout:
538973883: gp: que head: 0
538973883: gp: que tail: 0
538973883: gp: pend node: 2250528
538973883: gp: que head: 2250528
538973883: gp: que tail: 2250528
538973883: gp: status from ues_dispatch_pending: 0
538973883: i_rcv_socket: 3
538973883: UEU RCV: waiting for IO on socket 3
Thanks
Ed Flannery
|