[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:SCHEDULER
Notice:Welcome to the Scheduler Conference on node HUMANEril
Moderator:RUMOR::FALEK
Created:Sat Mar 20 1993
Last Modified:Tue Jun 03 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1240
Total number of notes:5017

1058.0. "Alpha server communicating with Alpha agent, job always in a running state." by BSS::FLANNERY () Wed Apr 03 1996 15:17

We have now two customers who are experiencing the issue of jobs 
remaining in a running state on alpha server with alpha agents.

If we have two AXP's and AXP AGENT and an AXP SERVER. Agent running 2.1b-5
server running 2.1b-7.  The other system is running an older version
of scheduler and the agensts

If we run a job, with debug on, we see the job in a running state,
we see the pid on the agent with a sched sho job/fu. we see the log
file of the job finished on the agent with logout stats.  the job completed
correctly. 

If the agent is down and we try to run the job we immediately get back
an error, if we do not have the proxy set up correctly we get an error
immediately.

If the job however runs and the vms command is asdf, the job will error
out on the agent however the job remains in a running state.  Here is
the debug log turned on with debug set at six.  Too high I know.

From the log file where is the information where we attempt to send
a termination message.....

Ip address is correct for each node.
Thanks 
Ed Flannery


538971437: gp: forwarding to port: 5482  address: 144.60.199.120<CR>
538971437: in ueu_send_to_address<CR>
538971437: connect: 1,   shutdown: 1,   close: 1<CR>
538971437: in ueu_add_id_to_packet<CR>
538971437: adding packet id: 828522196 to the packet<CR>
538971437: adding pid: 538971437 to the packet<CR>
put packet<CR>
item:<CR>
        code:   29<CR>
        length: 4<CSI>19;H        current length: 260<CR>
item:<CR>
        code:   30<CSI>22;H<CSI>;7m Buffer: NSCHED$AGENT.LOG                   
        length: 4<CR>

        current length: 280<CR>
538971437: in ueu_send_one_message<CR>
538971437: connect: 1,   shutdown: 1,   close: 1<CR>
538971437: send socket is 0; init'ing a temporary socket<CR>
538971437: UEU INIT<CR>
538971437: opened requested socket 4
538971437:      errno: 61  <connection refused  ><CR>
538971437: retry 0 of 3<CR>
538971437: error on connect to socket 4
38D<CSI>4h<CSI>4l^Nh^O38971437: err
		4l538971437:      errno: 22  <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 1 of 3<CR>
538971437: error on connect to socket 4<CR>
538971437:      errno: 22  <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 2 of 3<CR>
538971437: error on connect to socket 4<CR>
538971437:      errno: 22  <invalid argument><CR>
538971437: unplanned errno: 22<CR>
538971437: retry 3 of 3<CR>
538971437: final error on connect to socket 4<CR>
538971437: s0<CR>
538971437: job 80 node DC7A1 (pid 538971442) terminated with status 1<CSI>67D  
<CSI>4l538971437: failed to send termination message to sched node<CR>
538971437: storing job rec<CR>
in modify job rec, key=1<CR>
in read job rec<CR>
   
38971437: job 80 node DC7A1 (pid 538971442) terminated with status 1<CSI>67D  
<CSI>4l538971437: failed to send termination message to sched node<CR>
538971437: storing job rec<CR>
in modify job rec, key=1<CR>
in read job rec<CR>
looking for job 80, pid 538971442 matching<CR>
l_found: Num:80 Pid:538971442 Active:1 Node:DC7A1<CR>
MATCHED: PID=538971442   JOB#=80   NODE=DC7A1<CR>
record updated: 65537<CR>
538971437:      gp: in process_pending; adding node to pending que<CR>
538971437:      gp: que head:  2659800<CR>
538971437:      gp: que tail:  2659800<CR>
538971437:      gp: pend node: 2662712<CR>
538971437:      gp: que head:  2659800<CR>
538971437:      gp: que tail:  2662712<CR>
538971437:    gp: status from ues_process_packet: 0<CSI>49D<CSI>4h<CR>
<CSI>4l538971437:      gp: in dispatch_pending<CR>
538971437:    gp: next pending node has not yet timed out<CR>
538971437:    gp: status from ues_dispatch_pending: 0<CR>
538971437: i_rcv_socket: 3<CR>
538971437: UEU RCV: waiting for IO on socket 3<CR>
538971437: ueu_wait_for: starting<CR>
538971437: ueu_wait_for - timeout: 28/0<CR>
538971437: ueu_wait_for: calling select<CR>
538971437: ueu_wait_for: select done<CR>
538971437: ueu_wait_for: io received; returning<CR>
538971437: received connection on new socket 4 from node 144.60.199.120 port 11
538971437: ueu_receive_one_message: starting<CR>
538971437: ueu_receive_one_message: block SIGCLD<CR>
538971437: ueu_receive_one_message: recv on socket 4<CR>
538971437: ueu_receive_one_message: reset signals<CSI>47D<CSI>4h<CR>
<CSI>4l538971437: 140 bytes read from socket 4  port: 1121  address: 144.60.199
538971437: ueu_receive_one_message: about to shutdown socket<CR>
538971437: ueu_receive_one_message: about to close socket 4<CR>
538971437: ueu_receive_one_message: returning<CR>
538971437: Dump Packet:<CR>

T.RTitleUserPersonal
Name
DateLines
1058.1How about a pointer on this call is there any additional information that I can supply ?BSS::FLANNERYFri Apr 12 1996 10:2411
I know you are all busy, but we do have a serious issue here.

Perhaps I did not supply all the information necessary to turn on a "Light bulb" on this one,
is there any additional information that I can supply that will assist in the solving of 
this issue?

Perhaps an IPMT case is needed, since this is quite an impact to these customers?

Thanks
Ed Flannery

1058.2MRBASS::PUISHYSProject Leader Scheduler V3.0 for Digital UNIXFri Apr 12 1996 12:421
file an ipmt case please
1058.3I THINK WE FOUND AN ANWER, THE JOB always in a RUNNING STATEBSS::FLANNERYWed Apr 24 1996 18:3682

	FOLKS:
	I think we found the problem.....

	We have the following  configuration

	_______________________Fiddi ______________________
	        |		|
		0SERVER		0   AGENT
	________|_______________|____________Ethernet____________

	We can communicate between the server and the agent using the fiddi address
	but we can not communicate between the agent and the fiddi using the fiddi 
	address, using the command
	
		$ telnet   fiddi agent address  to port 5481 or 5481

	What we beleive we find is that scheduler is looking only at the 
	address in the logical ucx$inet_hostaddr. In the above configuration,
	there must be two addresses ucx$inet_hostaddr and ucx$inet_host_addr2.

	In our case, on the agent
			UCX$INET_HOSTADDR" = "206.101.128.196"	(fddi address)
  			"UCX$INET_HOSTADDR2" = "206.101.129.196" (ethernet address)
  
		Yet on the server the following was true.
			ucx$inet_hostaddr = 206.101.129.197	(ethernet)
			ucx$inet_hostaddr2 = "206.101.128.197"   (Fiddi)
	
	WE simply changed these logical so that the FIDDI logical was first that is on the
	server, we have the following 

			ucx$inet_hostaddr = 206.101.128.197	(FIDDI)
			ucx$inet_hostaddr2 = "206.101.129.197"   (ETHERNET)	

	I am wondering if the agent or scheduler is looking only for the logical
	ucx$inet_hostaddr and not taking into consideration the LOGICAL UCX$INET_addr2 
	thus not attempting to communicate back via ethernet.

	The following crazy output from the debug log file is quite a concern.

	What are the error 22, why the invalid argument, what argument?

5538973883: connect flag TRUE; connecting socket 4  to port 0  at node 206.101.128.197
538973883: error on connect to socket 4
538973883: opened requested socket 4
538973883: port requested is 0; assuming sending port: no bind
538973883: connect flag TRUE
538973883: connect flag TRUE; connecting socket 4  to port 0  at node 206.101.
128.197
538973883: error on connect to socket 4
538973883:      errno: 61  <connection refused  >
538973883: retry 0 of 3
538973883: error on connect to socket 4
538973883:      errno: 22  <invalid argument>
538973883: unplanned errno: 22
538973883: retry 1 of 3
538973883: error on connect to socket 4
538973883:      errno: 22  <invalid argument>
538973883: unplanned errno: 22
5538973883: retry 2 of 3
538973883: error on connect to socket 4
538973883:      errno: 22  <invalid argument>
538973883: unplanned errno: 22
538973883: retry 3 of 3
538973883: final error on connect to socket 4
538973883: send failed
538973883:    pending:  sending of termination message failed
538973883:    pending:  placing pending node back into queue with new timeout: 
538973883:      gp: que head:  0
538973883:      gp: que tail:  0
538973883:      gp: pend node: 2250528
538973883:      gp: que head:  2250528
538973883:      gp: que tail:  2250528
538973883:    gp: status from ues_dispatch_pending: 0
538973883: i_rcv_socket: 3
538973883: UEU RCV: waiting for IO on socket 3


Thanks
Ed Flannery