[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference bulova::decw_jan-89_to_nov-90

Title:DECWINDOWS 26-JAN-89 to 29-NOV-90
Notice:See 1639.0 for VMS V5.3 kit; 2043.0 for 5.4 IFT kit
Moderator:STAR::VATNE
Created:Mon Oct 30 1989
Last Modified:Mon Dec 31 1990
Last Successful Update:Fri Jun 06 1997
Number of topics:3726
Total number of notes:19516

1244.0. "Disappearing Batch Jobs" by AISG::OTOOLE () Tue Aug 08 1989 17:29

    
    
    While running serveral DECWindows applications remotely in batch 
    on a 8800, running v5.1, serveral jobs terminate with the error
    below. This has been happing with serveral applications. Notes,
    DECW$clock etc...
    	
    Has anyone else seen this error message or expirenced this before.
    This 8800 is one, of two 8800 bootservers in a Mixed Interconnect cluster.
    It has 38 satellites. 
    
	XIO: non-translatable vms error code: 0x2DBA002, vms message:
	%decw-e-cnxabort, connection aborted		
	%XLIB-F-IOERROR, xlib io error
    
    
    Thank you
    Mark 

T.RTitleUserPersonal
Name
DateLines
1244.1link downSTAR::CYPRYCHTue Aug 08 1989 17:426
    %decw-e-connectabort   (my spelling may be slightly off)
    means that the link between the server and client disconnected.
    If the server node became unreachable the client would
    generate this message -- or if the logical link (depending
    on what type of transport) went down.

1244.2decw-e-cnxabortSTAR::CYPRYCHTue Aug 08 1989 17:442
    the spelling is %decw-e-cnxabort (.1)

1244.3And the answer IS..HYDRA::COARHave you mutated yet to-day?Thu Aug 17 1989 14:109
The implication here is that, if your applications bomb with this message, your
transport is flakey.  In the case of DECnet, it is equivalent to

    %SYSTEM-F-PATHLOST, path to network partner node lost

(among others).  N'est-�e pas?

#ken	:-)}

1244.4basically trueSTAR::CYPRYCHThu Aug 17 1989 15:4018
    Yes basically although your transport isn't necessarily 
    "flakey", although it could be... 
    
    You could have just rebooted the server node which disconnects
        the network link.
    Someone could have shut DECnet down.
    .. Basically the server could  have shutdown,  the machine
    	could have shut down,  DECnet could have shut down,
        or the link could have aborted (for a flakey reason),
        or the node become "unreachable".
    
    So yes, path to network partner node lost is what happens,
    but there can be many reasons.   Taking down the session
    altogether disconnects links too.
    
    I think that covers most of the reasons.... but there may
    be one or two more.

1244.5Or MAX BROADCAST NONROUTERS too lowSEWANE::MASSEYI left my heart in Software Services.Thu Aug 31 1989 15:2123
Here's the solution that worked for us in St. Louis:

                  <<< QUEEN::PIX1:[PUBLIC.NOTES]EPIC.NOTE;6 >>>
                     -< You can't go wrong with DECwrite >-
================================================================================
Note 1959.21              DECwrite or DECwindows error?                 21 of 21
DCC::HAGARTY "Essen, Trinken und Shaggen..."         13 lines  18-AUG-1989 04:24
                          -< Network configuration! >-
--------------------------------------------------------------------------------
Ahhh Gi'day...�

    Sounds like  the  infamous BROADCAST NONROUTERS problem! MAKE SURE THAT
    THIS IS DONE ON ALL SYSTEMS IN THE LAN, but firstly yours...

    Count the  numbers of nonrouters on the LAN (say in the region of 300),
    and do a:

    $ MC NCP SET EXEC MAX BROADCAST NONROUTERS 512
    $ MC NCP DEF EXEC MAX BROADCAST NONROUTERS 512

    This will stop the timeouts happening to the other nodes in the LAN! On
    big machines, make it 1024!

1244.6Only applicable to routers, not endnodesMIPSBX::thomasThe Code WarriorThu Aug 31 1989 18:321
1244.7I'm experiencing a similar kind of problem...ASHBY::FEATHERSTONEd FeatherstonWed Sep 06 1989 17:0913
I have a MI Cluster with 2 8800's and 34 VAXStation's, all running DECWindows.
Almost all the DECWindows applications are running remotely on the 8800's. At
least 1 or 2 times a day everyone will lose from 1 to all of their remote
DECWindows applications at the same time. The logfiles show the connection
aborted error message. I am going nuts trying to figure out the cause. The
applications that get aborted are not all on the same 8800. Some applications
remain running without a problem. 

(further info. Each 8800 has 128MB of memory, running VMS 5.1, DECWindows V1,
the VAXStations are a mixture of VS-II's, VS-II/GPX, and VS-2000's).

Any ideas as to what I should be looking for/at?

1244.8Boot requestsCASEE::CLEOVOULOUMarios CleovoulouThu Sep 07 1989 08:0426
    I'll bet the cause is either cluster transitions or boot request
    multicasts.  We almost-totally cured the same problem by:
    
    a)	isolating our MIC from the rest of the ethernet with bridges,
    
    b)	giving NETACP lots of memory, by use of the NETACP$... logicals.
    	Note:: when a boot request comes in, NETACP goes through the
        _entire_ nodename database, starting at node 1.1 going upwards,
    	looking for an entry with a matching HW address (implemented by
        people in area 2, right :-).  NETACP runs at high priority and pages 
        the system to death if it doesn't have enough memory.
    
    c)	defining the boot parameters for our satellites up against "fake"
        nodes in area 1, so NETACP finds them quickly (we are really in
        area 51!),
    
    d)	defining HW addresses but not load assist parameters under fake area
        1 nodes for nodes still on our ethernet segment, but NOT part of
        our cluster, so that NETACP finds them and FAILS to load them
        quickly, rather than trashing through the entire database to not
        find them.
    
    Regards,
    
    Marios

1244.9re: boot requests and cluster transitionASHBY::FEATHERSTONEd FeatherstonThu Sep 07 1989 09:1414
Thanks for the info. We already have some of your suggestions in place.

	1. We are on a separate ethernet segment isolated behind a bridge
	   (requirement for clusters in Hudson)

	2. We had already given NETACP lots of memory (WSQUOTA of 10000,
	   in last 14 days of uptime the max working set size was 7600)

We are in area 6 so the search doesn't take a long time, but I hadn't thought
about the impact of systems we don't load being on the cable. I like the idea
of using fake area 1 nodes to handle that and will give it a try.

Is there anything that can help the cluster transitions?

1244.10MARVIN::WARWICKWell, that&#039;ll never workThu Sep 07 1989 09:4613
    
    RE: Cluster transtions.
    
    Assuming that cluster transitions are really causing your problem (use
    SHOW CLUSTER to see whether a node entering or leaving coincides with
    your problem occurring), see the conference ELKTRA::CLUSTER, where the
    subject is discussed at interminable length !  There are several things
    you can do to tune a cluster to make the transitions short. I have a
    34-38 node LAVC with two uVAX 3600s as boot nodes, and we just do not
    notice satellites coing and going at all.
    
    Trevor

1244.11Update on reply .7ROLL::FEATHERSTONEd FeatherstonWed Sep 13 1989 10:4611
We seem to have minimized the frequency of the problem using the
MAX BROADCAST NONROUTERS suggestion in an earlier reply, but have 1 
guaranteed way of producing it. That is when we add a new satellite
to the cluster (nodes coming and going don't appear to generate the
problem though). We are now trying to isolate what is actually happening
that is different at that time (while the users scream, since each time
we add a new node, they are guaranteed to lose some windows) as opposed
to the normal comings and goings of satellites.

				/ed/

1244.12Update on the updateASHBY::FEATHERSTONEd FeatherstonWed Sep 20 1989 10:327
We seem to have the last of the problem solved. When adding new nodes into the
cluster the local disk of the new node was initially MSCP served so a page and
swap file could be built. By not taking this option, we were able to add new
nodes to the cluster without any disconnects happening. 

				/ed/

1244.13DECWIN::JMSYNGEJames M Synge, VMS DevelopmentWed Sep 20 1989 17:184
    Why do you think this would make a difference?
    
    James

1244.14Not sure as to why it made a difference...ROLL::FEATHERSTONEd FeatherstonFri Sep 22 1989 13:2820
...when we determined that adding a node caused the problem, where-as rebooting
did not, the only difference between the 2 scenerios that was obvious was the
MSCP serving, so we tried it and voila, adding a node no longer caused the
problem.

A guess as to the reason is that when the disk is MSCP served, all the satellites
are forced to see the disk, requiring some amount of resource in the page pool
area. All the workstations are fairly tight on resources, so possibly this was
enough to push them over the edge (just a guess, don't have the time or resource
to verify it).

As a side note, we are still not preventing it 100% of the time, but with the
changes mentioned previously we have drastically reduced the frequency of the
occurances. (the MAX BROADCAST NONROUT item makes a BIG difference for us. The
value inadvertantly was reduced on one of our nodes the other day, and suddenly
people were losing stuff left and right. As soon as we raised it back up, things
stablized very quickly).

					/ed/