[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:	SCHEDULER
Notice:	Welcome to the Scheduler Conference on node HUMANEril
Moderator:	RUMOR::FALEK

Created:	Sat Mar 20 1993
Last Modified:	Tue Jun 03 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1240
Total number of notes:	5017

1058.0. "Alpha server communicating with Alpha agent, job always in a running state." by BSS::FLANNERY () Wed Apr 03 1996 15:17

We have now two customers who are experiencing the issue of jobs 
remaining in a running state on alpha server with alpha agents.

If we have two AXP's and AXP AGENT and an AXP SERVER. Agent running 2.1b-5
server running 2.1b-7.  The other system is running an older version
of scheduler and the agensts

If we run a job, with debug on, we see the job in a running state,
we see the pid on the agent with a sched sho job/fu. we see the log
file of the job finished on the agent with logout stats.  the job completed
correctly. 

If the agent is down and we try to run the job we immediately get back
an error, if we do not have the proxy set up correctly we get an error
immediately.

If the job however runs and the vms command is asdf, the job will error
out on the agent however the job remains in a running state.  Here is
the debug log turned on with debug set at six.  Too high I know.

From the log file where is the information where we attempt to send
a termination message.....

Ip address is correct for each node.
Thanks 
Ed Flannery


538971437: gp: forwarding to port: 5482  address: 144.60.199.120<CR>
538971437: in ueu_send_to_address<CR>
538971437: connect: 1,   shutdown: 1,   close: 1<CR>
538971437: in ueu_add_id_to_packet<CR>
538971437: adding packet id: 828522196 to the packet<CR>
538971437: adding pid: 538971437 to the packet<CR>
put packet<CR>
item:<CR>
        code:   29<CR>
        length: 4<CSI>19;H        current length: 260<CR>
item:<CR>
        code:   30<CSI>22;H<CSI>;7m Buffer: NSCHED$AGENT.LOG                   
        length: 4<CR>

        current length: 280<CR>
538971437: in ueu_send_one_message<CR>
538971437: connect: 1,   shutdown: 1,   close: 1<CR>
538971437: send socket is 0; init'ing a temporary socket<CR>
538971437: UEU INIT<CR>
538971437: opened requested socket 4
538971437:      errno: 61  <connection refused  ><CR>
538971437: retry 0 of 3<CR>
538971437: error on connect to socket 4
38D<CSI>4h<CSI>4l^Nh^O38971437: err
		4l538971437:      errno: 22  <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 1 of 3<CR>
538971437: error on connect to socket 4<CR>
538971437:      errno: 22  <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 2 of 3<CR>
538971437: error on connect to socket 4<CR>
538971437:      errno: 22  <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 3 of 3<CR>
538971437: final error on connect to socket 4<CR>
538971437: s0<CR>
538971437: job 80 node DC7A1 (pid 538971442) terminated with status 1<CSI>67D  
<CSI>4l538971437: failed to send termination message to sched node<CR>
538971437: storing job rec<CR>
in modify job rec, key=1<CR>
in read job rec<CR>
   
38971437: job 80 node DC7A1 (pid 538971442) terminated with status 1<CSI>67D  
<CSI>4l538971437: failed to send termination message to sched node<CR>
538971437: storing job rec<CR>
in modify job rec, key=1<CR>
in read job rec<CR>
looking for job 80, pid 538971442 matching<CR>
l_found: Num:80 Pid:538971442 Active:1 Node:DC7A1<CR>
MATCHED: PID=538971442   JOB#=80   NODE=DC7A1<CR>
record updated: 65537<CR>
538971437:      gp: in process_pending; adding node to pending que<CR>
538971437:      gp: que head:  2659800<CR>
538971437:      gp: que tail:  2659800<CR>
538971437:      gp: pend node: 2662712<CR>
538971437:      gp: que head:  2659800<CR>
538971437:      gp: que tail:  2662712<CR>
538971437:    gp: status from ues_process_packet: 0<CSI>49D<CSI>4h<CR>
<CSI>4l538971437:      gp: in dispatch_pending<CR>
538971437:    gp: next pending node has not yet timed out<CR>
538971437:    gp: status from ues_dispatch_pending: 0<CR>
538971437: i_rcv_socket: 3<CR>
538971437: UEU RCV: waiting for IO on socket 3<CR>
538971437: ueu_wait_for: starting<CR>
538971437: ueu_wait_for - timeout: 28/0<CR>
538971437: ueu_wait_for: calling select<CR>
538971437: ueu_wait_for: select done<CR>
538971437: ueu_wait_for: io received; returning<CR>
538971437: received connection on new socket 4 from node 144.60.199.120 port 11
538971437: ueu_receive_one_message: starting<CR>
538971437: ueu_receive_one_message: block SIGCLD<CR>
538971437: ueu_receive_one_message: recv on socket 4<CR>
538971437: ueu_receive_one_message: reset signals<CSI>47D<CSI>4h<CR>
<CSI>4l538971437: 140 bytes read from socket 4  port: 1121  address: 144.60.199
538971437: ueu_receive_one_message: about to shutdown socket<CR>
538971437: ueu_receive_one_message: about to close socket 4<CR>
538971437: ueu_receive_one_message: returning<CR>
538971437: Dump Packet:<CR>

T.R	Title	User	Personal Name	Date	Lines
1058.1	How about a pointer on this call is there any additional information that I can supply ?	BSS::FLANNERY		`Fri Apr 12 1996 10:24`	11
	I know you are all busy, but we do have a serious issue here. Perhaps I did not supply all the information necessary to turn on a "Light bulb" on this one, is there any additional information that I can supply that will assist in the solving of this issue? Perhaps an IPMT case is needed, since this is quite an impact to these customers? Thanks Ed Flannery
1058.2		MRBASS::PUISHYS	Project Leader Scheduler V3.0 for Digital UNIX	`Fri Apr 12 1996 12:42`	1
	file an ipmt case please
1058.3	I THINK WE FOUND AN ANWER, THE JOB always in a RUNNING STATE	BSS::FLANNERY		`Wed Apr 24 1996 18:36`	82
	FOLKS: I think we found the problem..... We have the following configuration _______________________Fiddi ______________________ \| \| 0SERVER 0 AGENT ________\|_______________\|____________Ethernet____________ We can communicate between the server and the agent using the fiddi address but we can not communicate between the agent and the fiddi using the fiddi address, using the command $ telnet fiddi agent address to port 5481 or 5481 What we beleive we find is that scheduler is looking only at the address in the logical ucx$inet_hostaddr. In the above configuration, there must be two addresses ucx$inet_hostaddr and ucx$inet_host_addr2. In our case, on the agent UCX$INET_HOSTADDR" = "206.101.128.196" (fddi address) "UCX$INET_HOSTADDR2" = "206.101.129.196" (ethernet address) Yet on the server the following was true. ucx$inet_hostaddr = 206.101.129.197 (ethernet) ucx$inet_hostaddr2 = "206.101.128.197" (Fiddi) WE simply changed these logical so that the FIDDI logical was first that is on the server, we have the following ucx$inet_hostaddr = 206.101.128.197 (FIDDI) ucx$inet_hostaddr2 = "206.101.129.197" (ETHERNET) I am wondering if the agent or scheduler is looking only for the logical ucx$inet_hostaddr and not taking into consideration the LOGICAL UCX$INET_addr2 thus not attempting to communicate back via ethernet. The following crazy output from the debug log file is quite a concern. What are the error 22, why the invalid argument, what argument? 5538973883: connect flag TRUE; connecting socket 4 to port 0 at node 206.101.128.197 538973883: error on connect to socket 4 538973883: opened requested socket 4 538973883: port requested is 0; assuming sending port: no bind 538973883: connect flag TRUE 538973883: connect flag TRUE; connecting socket 4 to port 0 at node 206.101. 128.197 538973883: error on connect to socket 4 538973883: errno: 61 <connection refused > 538973883: retry 0 of 3 538973883: error on connect to socket 4 538973883: errno: 22 <invalid argument> 538973883: unplanned errno: 22 538973883: retry 1 of 3 538973883: error on connect to socket 4 538973883: errno: 22 <invalid argument> 538973883: unplanned errno: 22 5538973883: retry 2 of 3 538973883: error on connect to socket 4 538973883: errno: 22 <invalid argument> 538973883: unplanned errno: 22 538973883: retry 3 of 3 538973883: final error on connect to socket 4 538973883: send failed 538973883: pending: sending of termination message failed 538973883: pending: placing pending node back into queue with new timeout: 538973883: gp: que head: 0 538973883: gp: que tail: 0 538973883: gp: pend node: 2250528 538973883: gp: que head: 2250528 538973883: gp: que tail: 2250528 538973883: gp: status from ues_dispatch_pending: 0 538973883: i_rcv_socket: 3 538973883: UEU RCV: waiting for IO on socket 3 Thanks Ed Flannery