[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference clusta::acms

Title:	ACMS comments and questions
Notice:	This is not an official software support channel. Kits 5.*
Moderator:	CLUSTA::HALLAN

Created:	Mon Feb 17 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4179
Total number of notes:	15091

4130.0. "%ACMSMSS-I-ERRLNKREAD, -SYSTEM-F-FILNOTACC" by PRSSOS::MAGENC () Thu Mar 20 1997 12:38


				Hello !


	A customer of ours (the most important Stock exchange company 
	in France) runs a user-written ACMS/NSP application on a VAX 
	(Open VMS 6.1 , DECnet/OSI 6.2 eco 8, ACMS V 4.0.2)
	This customer cannot upgrade to vms 6.2 and dnvosi 6.3 eco 6
	(this takes a very long time to upgrade to the latest versions
	 because they have to test their application very carefully
         before they can get sure there's no problem with it).

	Since this VAX has been upgraded from VMS 5.5-2 decnet IV to
	VMS 6.1 decnet/osi , 3 months ago, ACMS  (swlup) logs hundreds
	of errors per day , but the application seems to work normally and
	the users don't complain.

	The errors reported by swlup are :

-------------------------------------------------------------------------------
          SWLUP Event Log Output Listing                  19-MAR-1997 13:38:09
	  ACMS SWLUP V4.0-0                   Page 1

Event logged at: 19-MAR-1997 11:50:13.38

PID = 404006DF    PC = 000906FE    PSL = 03C00000

Process UIC  : [030,024]
Process name : ACMS01EXC003000
Username     : ACMS_EXC
Image file   : DSA0:[SYS1.SYSCOMMON.][SYSEXE]ACMSEXC.EXE;5

%ACMSMSS-I-ERRLNKREAD, Error reading network link
-SYSTEM-F-FILNOTACC, file not accessed on channel

-------------------------------------------------------------------------------


	In fact , this application runs on 2 different clusters , 1 of
	them being a "mirror backup" of the other one . The swlup logs
	are identical on both clusters since the upgrade.

	The customer wants to know the cause of this "problem", because
	he intends to increase the network load (today ,the ACMS server
	permanently manages about 350 NSP links) , and an application
	failure would cause a catastrophic impact for his company and
	Digital. 

	Despite the fact that the VAX is perfectly tuned , as well as ACMS,
	(no pool/buffer/quotas exhausted) , such entries are daily reported
	by swlup.  Note : decnet runs over an ethernet circuit , on a lan
	that seams heavy loaded , according to the ethernet counters : 
	no ethernet errors (excepted a very few excessive collisions) ,
	but in average , for 3 frames transmitted, 1 cannot be immediatly 
	transmitted (either initially deferred or single/multiply collision). 

	I'd like an ACMS specialist to tell me :

	- are such problems already known ?
	- could this problem easily be reproduced ?
        - is there a swlup option to get more info logged ? 
          (I'd like to have the internal identification
           of the link on which the read fails)
	- do you think this problem could be due this erroneous behaviour
	  of DECnet/OSI ?

	Thanks in advance for any advice/comment ,
	
	Best regards , Michele Magenc , TSC France, Network Group.
	
--------------------------------------------------------------------------------

	More info :

	I think DECnet/OSI could be the cause of this problem : 

	IPMT case CFS_38060 (solved november 96 fix integrated into
	DECnet/OSI V6.3 eco 6) entitled 

	<<DECnet/OSI returns wrong status codes on link disconnect>>

	shows that if an established link is disconnected,
	when an application performs an I/O operation on the channel 
	associated to this link, DECnet/OSI erroneously reports 
	"filnotacc" status instead of "linkdiscon" status.

	Having traced (with CTF at NSP level) on the ACMS server system
	while swlup logged such messages , I noticed lots of links
	disconnected by the ACMS client , reason = abort .

	I'm unable to decide which one of these links fails from
	the swlup point of view, because swlup does'nt log the
	link identification , and the traces are incomplete (records
	lost and so much packets transferred that I don't have
	the trace for every link , from it's establishment up
	to it's disconnection).

	But from these traces , I get the feeling that none of these
	links are disconnected because of network problems .
	In fact , I'm under the impression that all these links
	are "normally" disconnected by the application , because
	each time , I observe the same scenario : the link request
	comes from the client , is ackowleged and confirmed by the server,
	then 5 data messages are exchanged ; the fifth message always
	comes from the client, is always 21 bytes long .

T.R	Title	User	Personal Name	Date	Lines
4130.1	Start with DECnet/OSI	OHMARY::HALL	Bill Hall - ACMS Engineering - ZKO2-2	`Thu Mar 20 1997 13:34`	10
	Since the errors started to occur when you went from Phase IV to DECnet/OSI, I'd start with the DECnet/OSI folks. ACMS is not aware of which 'phase' of DECnet is running, in fact, we do not support DECnet/OSI except with Phase IV routing. The text of the error, FILNOTACC, does not make sense to me, but it might to a DECnet/OSI person. Bill
4130.2	Some more info/questions	PRSSOS::MAGENC		`Fri Mar 21 1997 04:58`	42
	Hi ! Thanks for answering . I am a decnet/osi person, so FILNOTACC does make sense to me. As I intend to escalate this pb through IPMT channels , I suppose the DECnet/OSI engineering will need some detailed info from swlup : in particular the link id associated to the read failure reported . If we don't have any information about how ACMS interfaces with DECnet and if we're unable to associate NSP traces to SWLUP logs , I'm afraid we won't be able to go further ! Despite this status FILNOTACC does not make sense to you, you might already have experimented such problems : a collegue of mine , here in TSC, who is an ACMS specialist, once had the same swlup logs on his system, and got rid of them just with a "good" ACMS tuning. Unfortunatly, it didn't work for this customer ... Now, I do agree that the DECnet "phase" is not ACMS's problem, but anyway, ACMS server code is responsible to detect, report and eventually survive network errors. So I think FILNOTACC should have (and certainly does) a meaning for ACMS . Please help ! A few words about FILNOTACC (on decnet network operations): It normally means that an application tries an I/O operation (usually read or write) on a network link which is not yet established . It happens that some versions of DECnet/OSI incorrectly return this status to decnet applications, and this may confuse them. In this erroneous circumstance , DECnet/OSI reports FILNOTACC when a link has been established , then disconnected . Thanks again , and best regards , Michele.
4130.3		IDFOUR::HALL	Bill Hall - ACMS Engineering - ZKO2-2	`Fri Mar 21 1997 06:57`	21
	As you can see, the errors are handled by ACMS. The error returned from the service was a -F- but we reported it in SWL as a -I-. When a link breaks, MSS (our messaged-based interconnect) reports the error and then tries to re-establish the link. In this case, the customer is not reporting any user problems, so the links are re-established and the operation continues successfully. One thing about MSS is that it multiplexes the operations on a link. The CP process is the user-interface, serving probably around 20 users. If each user is talking to the same application on the same remote node, there are only 2 links, one for executing the task, the other for performing the necessary presentation services. All 20 users use the same 2 links. DECnet/OSI is just now starting to cause problems for us and we'd appreciate some assistance in understanding how it works differently from DECnet Phase IV. Bill
4130.4	Thanks + decnet/osi info	PRSSOS::MAGENC		`Fri Mar 21 1997 12:20`	114
	Hi ! Thanks for those info : these are good news ... The nsp traces (ctf) I've done during such logs were reported by swlup show that : 1) there are lots of links established 2) there are lots of links disconnected ( much more than reported by swlup) 3) these links are always disconnected with "reason = abort", most of the time (I saw 1 exception for 30 links) by the ACMS client. 4) on these links , the same scenario occurs each time : the link is established (connect request sent by acms client), then exactly 5 NSP data messages are exchanged between the server and the client ; the data messages sent by the client are "short" ; the fourth data message is always sent by the server , ans is "long" (over 1400 bytes , thus segmented into 2 parts) ; the fifth message is always sent by the client, and is 21 bytes longs. After the 5th message has been sent by the client, the client send a disconnect request (DISI , reason = abort) , and the server confirms the disconenction (DISC). I must say that there's so much traffic that in 2 minute, I get 2000 binary blocks and some trace records lost ; so I 'm unable to "follow" each link from it's establishment up to it's disconnection , and I'm unable to decide if some of the link disconnections that are kept into the trace file are related to swlup FILNOTACC records. Now, about the decnet/osi behaviour difference (from decnet IV), please see below some extracts from CFS #38060: ------------------------------------------------------------------------ DECnet/OSI returns wrong status codes on link disconnect which may be 'breaking' BACKUP and DFS DECnet/OSI is returning the wrong status on a number of occasions when links are disconnected. This appears to involve NSP, OSI Transport, and Session Control, ... As stated above, there are actually a number of problems, but the most critical at this instant in time is that when links are disconnected DECnet/OSI returns "%SYSTEM-F-FILNOTACC, file not accessed on channel" as the error status in the IOSB, and this appears to be confusing both BACKUP and DFS to the point where they 'hang'. I can easily reproduce this problem with NSP Transport, but not with OSI Transport. I am not sure why, except that OSI Transport status returns seem to be so badly broken that the confusing status may not be passed up the stack --- This is what I expect from "%SYSTEM-F-FILNOTACC, file not accessed on channel" error ... I have always thought that this error status would be returned in R0 (not the IOSB) if I attempted to do a file operation (eg read/write virtual to a file device) when I did not have a file open. In network terms I would expect this status in R0 if I did a read/write virtual to a NET device when I have not done an IO$_ACCESS to set up a logical link. Once I have got past the R0 check I do not expect to get this error -- ie I have never expected this error in the IOSB. If something goes severely wrong after the IO has been issued but before it completes then I expect a SS$_ABORT or other type of error indicating why it failed. What appears to happen with DECnet/OSI is that once a link is established you can issue read/write etc, but if the link is disconnected some of the outstanding IO at that time are completed with FILNOTACC in the IOSB. I have seen IO$_WRITEVBLK and IO$_DEACCESS with this error, but not IO$_READVBLK (which may or may not be significant). I believe that the IO$_WRITEVBLK should fail with PATHLOST, UNREACHABLE, or THIRDPARTY (depending upon the type of disconnect), and this would be compatible with PhaseIV behaviour. The IO$_DEACCESS should return success. When an application is notified of a disconnect the application must do an IO$_DEACCESS on the channel to free its end of the link. If the application does not do a IO$_DEACCESS then the Session Control port is not deleted and it hangs around, using up resources. The IO$_DEACCESS should clean up and return success. .................................. Event Type: SOLUTION Date & Time: 1-Oct-1996 Actor: BADGE\96563 Mike Dyer has concluded his changes for $QIO. These fixes have been tested and will be included in the next ECO kit for DNVOSI V6.3 ECO6. Fix for Filnotaccess,Unkresult and Remoteshutdown. Check the QIO UCB for a local disconnect and add the mappings for I-DISCDATATRUN and REMOTESHUTDOWN Modify; QIO_EXECUTE.B32, QIO_COMPLETION.B32, QIO_MAPERR.B32 and QIO_STRUCTURES.SDL Directory HELP""::ABBYRD$DKA100:[KANSAS.KITS.VAX] NET$DRIVER.EXE 52 30-SEP-1996 21:21:52.00 NET$DRIVER.STB 6 30-SEP-1996 21:21:59.00 NET$OSDRIVER.EXE 74 30-SEP-1996 21:22:03.00 NET$OSDRIVER.STB 7 30-SEP-1996 21:22:10.00 copy to sys$common:[sys$ldr] ------------------------------------------------------------ Personal addendum : Decnet/OSI eco 6 release notes do not clearly describe these changes (no details for FILNOTACC) the net$driver and net$osdriver coming with eco 6 are dated 15-NOV-1996 , which let's me thing other problems have been fixed.