[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference clusta::acms

Title:ACMS comments and questions
Notice:This is not an official software support channel. Kits 5.*
Moderator:CLUSTA::HALLAN
Created:Mon Feb 17 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4179
Total number of notes:15091

4130.0. "%ACMSMSS-I-ERRLNKREAD, -SYSTEM-F-FILNOTACC" by PRSSOS::MAGENC () Thu Mar 20 1997 12:38


				Hello !


	A customer of ours (the most important Stock exchange company 
	in France) runs a user-written ACMS/NSP application on a VAX 
	(Open VMS 6.1 , DECnet/OSI 6.2 eco 8, ACMS V 4.0.2)
	This customer cannot upgrade to vms 6.2 and dnvosi 6.3 eco 6
	(this takes a very long time to upgrade to the latest versions
	 because they have to test their application very carefully
         before they can get sure there's no problem with it).

	Since this VAX has been upgraded from VMS 5.5-2 decnet IV to
	VMS 6.1 decnet/osi , 3 months ago, ACMS  (swlup) logs hundreds
	of errors per day , but the application seems to work normally and
	the users don't complain.

	The errors reported by swlup are :

-------------------------------------------------------------------------------
          SWLUP Event Log Output Listing                  19-MAR-1997 13:38:09
	  ACMS SWLUP V4.0-0                   Page 1

Event logged at: 19-MAR-1997 11:50:13.38

PID = 404006DF    PC = 000906FE    PSL = 03C00000

Process UIC  : [030,024]
Process name : ACMS01EXC003000
Username     : ACMS_EXC
Image file   : DSA0:[SYS1.SYSCOMMON.][SYSEXE]ACMSEXC.EXE;5

%ACMSMSS-I-ERRLNKREAD, Error reading network link
-SYSTEM-F-FILNOTACC, file not accessed on channel

-------------------------------------------------------------------------------


	In fact , this application runs on 2 different clusters , 1 of
	them being a "mirror backup" of the other one . The swlup logs
	are identical on both clusters since the upgrade.

	The customer wants to know the cause of this "problem", because
	he intends to increase the network load (today ,the ACMS server
	permanently manages about 350 NSP links) , and an application
	failure would cause a catastrophic impact for his company and
	Digital. 

	Despite the fact that the VAX is perfectly tuned , as well as ACMS,
	(no pool/buffer/quotas exhausted) , such entries are daily reported
	by swlup.  Note : decnet runs over an ethernet circuit , on a lan
	that seams heavy loaded , according to the ethernet counters : 
	no ethernet errors (excepted a very few excessive collisions) ,
	but in average , for 3 frames transmitted, 1 cannot be immediatly 
	transmitted (either initially deferred or single/multiply collision). 

	I'd like an ACMS specialist to tell me :

	- are such problems already known ?
	- could this problem easily be reproduced ?
        - is there a swlup option to get more info logged ? 
          (I'd like to have the internal identification
           of the link on which the read fails)
	- do you think this problem could be due this erroneous behaviour
	  of DECnet/OSI ?

	Thanks in advance for any advice/comment ,
	
	Best regards , Michele Magenc , TSC France, Network Group.
	
--------------------------------------------------------------------------------

	More info :

	I think DECnet/OSI could be the cause of this problem : 

	IPMT case CFS_38060 (solved november 96 fix integrated into
	DECnet/OSI V6.3 eco 6) entitled 

	<<DECnet/OSI returns wrong status codes on link disconnect>>

	shows that if an established link is disconnected,
	when an application performs an I/O operation on the channel 
	associated to this link, DECnet/OSI erroneously reports 
	"filnotacc" status instead of "linkdiscon" status.

	Having traced (with CTF at NSP level) on the ACMS server system
	while swlup logged such messages , I noticed lots of links
	disconnected by the ACMS client , reason = abort .

	I'm unable to decide which one of these links fails from
	the swlup point of view, because swlup does'nt log the
	link identification , and the traces are incomplete (records
	lost and so much packets transferred that I don't have
	the trace for every link , from it's establishment up
	to it's disconnection).

	But from these traces , I get the feeling that none of these
	links are disconnected because of network problems .
	In fact , I'm under the impression that all these links
	are "normally" disconnected by the application , because
	each time , I observe the same scenario : the link request
	comes from the client , is ackowleged and confirmed by the server,
	then 5 data messages are exchanged ; the fifth message always
	comes from the client, is always 21 bytes long .    
	 	
	
	
T.RTitleUserPersonal
Name
DateLines
4130.1Start with DECnet/OSIOHMARY::HALLBill Hall - ACMS Engineering - ZKO2-2Thu Mar 20 1997 13:3410
    
    	Since the errors started to occur when you went from Phase IV
    	to DECnet/OSI, I'd start with the DECnet/OSI folks.  ACMS is
    	not aware of which 'phase' of DECnet is running, in fact, we
    	do not support DECnet/OSI except with Phase IV routing.  The
    	text of the error, FILNOTACC, does not make sense to me, but
    	it might to a DECnet/OSI person.
    
    	Bill
    
4130.2Some more info/questionsPRSSOS::MAGENCFri Mar 21 1997 04:5842
    
    
                                    Hi !
    
            Thanks for answering .
    
            I am a decnet/osi person, so FILNOTACC does make sense to me.
            As I intend to escalate this pb through IPMT channels ,
            I suppose the DECnet/OSI engineering will need some
            detailed info from swlup : in particular the link id 
            associated to the read failure reported . If we don't
            have any information about how ACMS interfaces with DECnet      
            and if we're unable to associate NSP traces to SWLUP logs ,
            I'm afraid we won't be able to go further !
    
            Despite this status FILNOTACC does not make sense to you,
            you might already have experimented such problems : 
            a collegue of mine , here in TSC, who is an ACMS specialist,
            once had the same swlup logs on his system, and got rid
            of them just with a "good" ACMS tuning. Unfortunatly,
            it didn't work for this customer ...
    
            Now, I do agree that the DECnet "phase" is not ACMS's problem,
            but anyway, ACMS server code is responsible to detect, report
            and eventually survive network errors.
            So I think FILNOTACC should have (and certainly does) a meaning
            for ACMS . Please help !
    
            A few words about FILNOTACC (on decnet network operations):
    
            It normally means that an application tries an I/O operation
            (usually read or write) on a network link which is not yet 
            established .
            It happens that some versions of DECnet/OSI incorrectly return
            this status to decnet applications, and this may confuse them.
            In this erroneous circumstance , DECnet/OSI reports FILNOTACC
            when a link has been established , then disconnected .
    
            
                    Thanks again , and best regards , Michele.
    
    
4130.3IDFOUR::HALLBill Hall - ACMS Engineering - ZKO2-2Fri Mar 21 1997 06:5721
    
    	As you can see, the errors are handled by ACMS.  The error returned
    from the service was a -F- but we reported it in SWL as a -I-.
    
    	When a link breaks, MSS (our messaged-based interconnect) reports
    the error and then tries to re-establish the link.  In this case, the
    customer is not reporting any user problems, so the links are
    re-established and the operation continues successfully.  One thing
    about MSS is that it multiplexes the operations on a link.  The CP
    process is the user-interface, serving probably around 20 users.
    If each user is talking to the same application on the same remote
    node, there are only 2 links, one for executing the task, the other
    for performing the necessary presentation services.  All 20 users
    use the same 2 links.
    
    	DECnet/OSI is just now starting to cause problems for us and we'd
    appreciate some assistance in understanding how it works differently
    from DECnet Phase IV.
    
    	Bill
    
4130.4Thanks + decnet/osi infoPRSSOS::MAGENCFri Mar 21 1997 12:20114
    
    
    			Hi !
    
    	Thanks for those info : these are good news ...
    	The nsp traces (ctf) I've done during such logs were reported
    	by swlup show that :
    	1) there are lots of links established
    	2) there are lots of links disconnected ( much more than reported
           by swlup)
    	3) these links are always disconnected with "reason = abort",
           most of the time (I saw 1 exception for 30 links) by the
    	   ACMS client.
    	4) on these links , the same scenario occurs each time :
           the link is established (connect request sent by acms client),
           then exactly 5 NSP data messages are exchanged between the
           server and the client ; the data messages sent by the client 
           are "short" ; the fourth data message is always sent by 
           the server , ans is "long" (over 1400 bytes , thus segmented
           into 2 parts) ; the fifth message is always sent by the
           client, and is 21 bytes longs.
    	   After the 5th message has been sent by the client, the client
           send a disconnect request (DISI , reason = abort) , and the
           server confirms the disconenction (DISC).
    
    	I must say that there's so much traffic that in 2 minute,
        I get 2000 binary blocks and some trace records lost ; so 
    	I 'm unable to "follow" each link from it's establishment
        up to it's disconnection , and I'm unable to decide if
    	some of the link disconnections that are kept into the trace
    	file are related to swlup FILNOTACC records.
    
    	Now, about the decnet/osi behaviour difference (from decnet IV),
        please see below some extracts from CFS #38060:
    
    ------------------------------------------------------------------------
    	DECnet/OSI returns wrong status codes on link disconnect
        which may be 'breaking' BACKUP and DFS
    
        DECnet/OSI is returning the wrong status on a number of
        occasions when links are disconnected. This appears to involve
        NSP, OSI Transport, and Session Control, ...
    
        As stated above, there are actually a number of problems, but
        the most critical at this instant in time is that when links are
        disconnected DECnet/OSI returns "%SYSTEM-F-FILNOTACC, file not
        accessed on channel" as the error status in the IOSB, and this
        appears to be confusing both BACKUP and DFS to the point where
        they 'hang'.
    
        I can easily reproduce this problem with NSP Transport, but not
        with OSI Transport. I am not sure why, except that OSI Transport
        status returns seem to be so badly broken that the confusing
        status may not be passed up the stack --- 
    
    This is what I expect from "%SYSTEM-F-FILNOTACC, file not
    accessed on channel" error ... I have always thought that this
    error status would be returned in R0 (not the IOSB) if I
    attempted to do a file operation (eg read/write virtual to a
    file device) when I did not have a file open. In network terms I
    would expect this status in R0 if I did a read/write virtual to
    a NET device when I have not done an IO$_ACCESS to set up a
    logical link. Once I have got past the R0 check I do not expect
    to get this error -- ie I have never expected this error in the
    IOSB. If something goes severely wrong after the IO has been
    issued but before it completes then I expect a SS$_ABORT or
    other type of error indicating why it failed. 
    
    What appears to happen with DECnet/OSI is that once a link is
    established you can issue read/write etc, but if the link is
    disconnected some of the outstanding IO at that time are
    completed with FILNOTACC in the IOSB. I have seen IO$_WRITEVBLK
    and IO$_DEACCESS with this error, but not IO$_READVBLK (which
    may or may not be significant). I believe that the IO$_WRITEVBLK
    should fail with PATHLOST, UNREACHABLE, or THIRDPARTY (depending
    upon the type of disconnect), and this would be compatible with
    PhaseIV behaviour. The IO$_DEACCESS should return success. When
    an application is notified of a disconnect the application must
    do an IO$_DEACCESS on the channel to free its end of the link.
    If the application does not do a IO$_DEACCESS then the Session
    Control port is not deleted and it hangs around, using up
    resources. The IO$_DEACCESS should clean up and return success.
    
    			..................................
    Event Type:                        SOLUTION
    Date & Time:                       1-Oct-1996
    Actor:                            BADGE\96563
    Mike Dyer has concluded his changes for $QIO.  These fixes have
    been tested and will be included in the next ECO kit for DNVOSI V6.3
    ECO6.
    
    Fix for Filnotaccess,Unkresult and Remoteshutdown.
    
     Check the QIO UCB for a local disconnect and add
     the mappings for I-DISCDATATRUN and REMOTESHUTDOWN
     Modify; QIO_EXECUTE.B32, QIO_COMPLETION.B32,
             QIO_MAPERR.B32 and QIO_STRUCTURES.SDL
    
    Directory HELP""::ABBYRD$DKA100:[KANSAS.KITS.VAX]
    
    NET$DRIVER.EXE            52  30-SEP-1996 21:21:52.00
    NET$DRIVER.STB             6  30-SEP-1996 21:21:59.00
    NET$OSDRIVER.EXE          74  30-SEP-1996 21:22:03.00
    NET$OSDRIVER.STB           7  30-SEP-1996 21:22:10.00
             copy to sys$common:[sys$ldr]
    ------------------------------------------------------------
    
    
    Personal addendum : Decnet/OSI eco 6 release notes do not clearly
                        describe these changes (no details for FILNOTACC)
                        the net$driver and net$osdriver coming with
                        eco 6 are dated 15-NOV-1996 , which let's me thing
                        other problems have been fixed.