[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference bulova::decw_jan-89_to_nov-90

Title:	DECWINDOWS 26-JAN-89 to 29-NOV-90
Notice:	See 1639.0 for VMS V5.3 kit; 2043.0 for 5.4 IFT kit
Moderator:	STAR::VATNE

Created:	Mon Oct 30 1989
Last Modified:	Mon Dec 31 1990
Last Successful Update:	Fri Jun 06 1997
Number of topics:	3726
Total number of notes:	19516

1244.0. "Disappearing Batch Jobs" by AISG::OTOOLE () Tue Aug 08 1989 16:29

    
    
    While running serveral DECWindows applications remotely in batch 
    on a 8800, running v5.1, serveral jobs terminate with the error
    below. This has been happing with serveral applications. Notes,
    DECW$clock etc...
    	
    Has anyone else seen this error message or expirenced this before.
    This 8800 is one, of two 8800 bootservers in a Mixed Interconnect cluster.
    It has 38 satellites. 
    
	XIO: non-translatable vms error code: 0x2DBA002, vms message:
	%decw-e-cnxabort, connection aborted		
	%XLIB-F-IOERROR, xlib io error
    
    
    Thank you
    Mark

T.R	Title	User	Personal Name	Date	Lines
1244.1	link down	STAR::CYPRYCH		`Tue Aug 08 1989 16:42`	6
	%decw-e-connectabort (my spelling may be slightly off) means that the link between the server and client disconnected. If the server node became unreachable the client would generate this message -- or if the logical link (depending on what type of transport) went down.
1244.2	decw-e-cnxabort	STAR::CYPRYCH		`Tue Aug 08 1989 16:44`	2
	the spelling is %decw-e-cnxabort (.1)
1244.3	And the answer IS..	HYDRA::COAR	Have you mutated yet to-day?	`Thu Aug 17 1989 13:10`	9
	The implication here is that, if your applications bomb with this message, your transport is flakey. In the case of DECnet, it is equivalent to %SYSTEM-F-PATHLOST, path to network partner node lost (among others). N'est-�e pas? #ken :-)}
1244.4	basically true	STAR::CYPRYCH		`Thu Aug 17 1989 14:40`	18
	Yes basically although your transport isn't necessarily "flakey", although it could be... You could have just rebooted the server node which disconnects the network link. Someone could have shut DECnet down. .. Basically the server could have shutdown, the machine could have shut down, DECnet could have shut down, or the link could have aborted (for a flakey reason), or the node become "unreachable". So yes, path to network partner node lost is what happens, but there can be many reasons. Taking down the session altogether disconnects links too. I think that covers most of the reasons.... but there may be one or two more.
1244.5	Or MAX BROADCAST NONROUTERS too low	SEWANE::MASSEY	I left my heart in Software Services.	`Thu Aug 31 1989 14:21`	23
	Here's the solution that worked for us in St. Louis: <<< QUEEN::PIX1:[PUBLIC.NOTES]EPIC.NOTE;6 >>> -< You can't go wrong with DECwrite >- ================================================================================ Note 1959.21 DECwrite or DECwindows error? 21 of 21 DCC::HAGARTY "Essen, Trinken und Shaggen..." 13 lines 18-AUG-1989 04:24 -< Network configuration! >- -------------------------------------------------------------------------------- Ahhh Gi'day...� Sounds like the infamous BROADCAST NONROUTERS problem! MAKE SURE THAT THIS IS DONE ON ALL SYSTEMS IN THE LAN, but firstly yours... Count the numbers of nonrouters on the LAN (say in the region of 300), and do a: $ MC NCP SET EXEC MAX BROADCAST NONROUTERS 512 $ MC NCP DEF EXEC MAX BROADCAST NONROUTERS 512 This will stop the timeouts happening to the other nodes in the LAN! On big machines, make it 1024!
1244.6	Only applicable to routers, not endnodes	MIPSBX::thomas	The Code Warrior	`Thu Aug 31 1989 17:32`	1

1244.7	I'm experiencing a similar kind of problem...	ASHBY::FEATHERSTON	Ed Featherston	`Wed Sep 06 1989 16:09`	13
	I have a MI Cluster with 2 8800's and 34 VAXStation's, all running DECWindows. Almost all the DECWindows applications are running remotely on the 8800's. At least 1 or 2 times a day everyone will lose from 1 to all of their remote DECWindows applications at the same time. The logfiles show the connection aborted error message. I am going nuts trying to figure out the cause. The applications that get aborted are not all on the same 8800. Some applications remain running without a problem. (further info. Each 8800 has 128MB of memory, running VMS 5.1, DECWindows V1, the VAXStations are a mixture of VS-II's, VS-II/GPX, and VS-2000's). Any ideas as to what I should be looking for/at?
1244.8	Boot requests	CASEE::CLEOVOULOU	Marios Cleovoulou	`Thu Sep 07 1989 07:04`	26
	I'll bet the cause is either cluster transitions or boot request multicasts. We almost-totally cured the same problem by: a) isolating our MIC from the rest of the ethernet with bridges, b) giving NETACP lots of memory, by use of the NETACP$... logicals. Note:: when a boot request comes in, NETACP goes through the _entire_ nodename database, starting at node 1.1 going upwards, looking for an entry with a matching HW address (implemented by people in area 2, right :-). NETACP runs at high priority and pages the system to death if it doesn't have enough memory. c) defining the boot parameters for our satellites up against "fake" nodes in area 1, so NETACP finds them quickly (we are really in area 51!), d) defining HW addresses but not load assist parameters under fake area 1 nodes for nodes still on our ethernet segment, but NOT part of our cluster, so that NETACP finds them and FAILS to load them quickly, rather than trashing through the entire database to not find them. Regards, Marios
1244.9	re: boot requests and cluster transition	ASHBY::FEATHERSTON	Ed Featherston	`Thu Sep 07 1989 08:14`	14
	Thanks for the info. We already have some of your suggestions in place. 1. We are on a separate ethernet segment isolated behind a bridge (requirement for clusters in Hudson) 2. We had already given NETACP lots of memory (WSQUOTA of 10000, in last 14 days of uptime the max working set size was 7600) We are in area 6 so the search doesn't take a long time, but I hadn't thought about the impact of systems we don't load being on the cable. I like the idea of using fake area 1 nodes to handle that and will give it a try. Is there anything that can help the cluster transitions?
1244.10		MARVIN::WARWICK	Well, that'll never work	`Thu Sep 07 1989 08:46`	13
	RE: Cluster transtions. Assuming that cluster transitions are really causing your problem (use SHOW CLUSTER to see whether a node entering or leaving coincides with your problem occurring), see the conference ELKTRA::CLUSTER, where the subject is discussed at interminable length ! There are several things you can do to tune a cluster to make the transitions short. I have a 34-38 node LAVC with two uVAX 3600s as boot nodes, and we just do not notice satellites coing and going at all. Trevor
1244.11	Update on reply .7	ROLL::FEATHERSTON	Ed Featherston	`Wed Sep 13 1989 09:46`	11
	We seem to have minimized the frequency of the problem using the MAX BROADCAST NONROUTERS suggestion in an earlier reply, but have 1 guaranteed way of producing it. That is when we add a new satellite to the cluster (nodes coming and going don't appear to generate the problem though). We are now trying to isolate what is actually happening that is different at that time (while the users scream, since each time we add a new node, they are guaranteed to lose some windows) as opposed to the normal comings and goings of satellites. /ed/
1244.12	Update on the update	ASHBY::FEATHERSTON	Ed Featherston	`Wed Sep 20 1989 09:32`	7
	We seem to have the last of the problem solved. When adding new nodes into the cluster the local disk of the new node was initially MSCP served so a page and swap file could be built. By not taking this option, we were able to add new nodes to the cluster without any disconnects happening. /ed/
1244.13		DECWIN::JMSYNGE	James M Synge, VMS Development	`Wed Sep 20 1989 16:18`	4
	Why do you think this would make a difference? James
1244.14	Not sure as to why it made a difference...	ROLL::FEATHERSTON	Ed Featherston	`Fri Sep 22 1989 12:28`	20
	...when we determined that adding a node caused the problem, where-as rebooting did not, the only difference between the 2 scenerios that was obvious was the MSCP serving, so we tried it and voila, adding a node no longer caused the problem. A guess as to the reason is that when the disk is MSCP served, all the satellites are forced to see the disk, requiring some amount of resource in the page pool area. All the workstations are fairly tight on resources, so possibly this was enough to push them over the edge (just a guess, don't have the time or resource to verify it). As a side note, we are still not preventing it 100% of the time, but with the changes mentioned previously we have drastically reduced the frequency of the occurances. (the MAX BROADCAST NONROUT item makes a BIG difference for us. The value inadvertantly was reduced on one of our nodes the other day, and suddenly people were losing stuff left and right. As soon as we raised it back up, things stablized very quickly). /ed/