[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::networker

Title:	NetWorker
Notice:	kits - 12-14, problem reporting - 41.*, basics 1-100
Moderator:	DECWET::RANDALL.com::lenox

Created:	Thu Oct 10 1996
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	750
Total number of notes:	3361

483.0. " NFS problem and savegroup hang " by EVTAI1::POUSSARD () Thu Mar 13 1997 03:24

	Hi,

	I red the remark in the addendum about unavailable NFS file systems, 
but I would have more clarifications.


	I would know the exact mechanism which happens when a NFS server is 
not responding.

	A customer of me, had the classical problem of the " NFS server 
server not responding, still trying" message. The NFS client machine which 
had this message was a Networker/NSR server, and the NFS file system was 
mounted with (hard, intr) options.


	The savegroup command which save the NSR server filesystems began at 
19:00pm and finish at 19:05pm. We have in /nsr/logs/messages the successful 
completion message
	At 19:45pm, we have the "NFS server server not responding, still 
trying" message, on the NSR server ( and only on the NSR server )

	At 20:00pm , the savegroup commands which backup all NSR clients 
began , but all savegroup commands become uninterruptible.
	
	The next morning, the customer tries to umount the NFS file system 
without success. He remarks that all savegroup commands hang since 20:00pm 
before

	He is obliged to reboot the NSR server, to avoid the NFS problem



	Assuming that savegroup command ran from root account, I said him to 
verify that there was no NFS path in the root's  PATH environnement variable 
concerning the annoying NFS server. This was not the case, so the savegroup 
command should not hang, unless savegroup do something like the df command, 
trying to access all file systems, or have a specific PATH.


	Can I have an explanation about that 



	Thanks for your replies



				Gilles.

T.R	Title	User	Personal Name	Date	Lines
483.1	well, let me try	DECWET::EVANS	NSR Engineering	`Thu Mar 13 1997 11:35`	14
	when NFS goes out to lunch, there's not much any application (such as NetWorker) can do. Especially when the filesystem is mounted hard. NetWorker passes an RPC message from NSR-server to NSR-client (nsrexecd) which then tries to fstat each filesystem to gather info about "local" systems to backup. It's here that fstat hits the NFS mountpoint, and if NFS is gone, the fstat system call just does not return. I see 2 points of failure: the RPC system call, and the fstat system call - both rely upon the network. Thus, this is a system level issue, not really NetWorker (NFS, Unix) Did you try to restart your network???
483.2	NFS problem on the NSR server	EVTAI1::POUSSARD		`Fri Mar 14 1997 01:30`	8
	The problem here is that the NFS problem occured on the NSR server , not on the NSR clients, and savegroups which started at 20:00pm had nothing to do with the NSR server filesystems Gilles.
483.3	Could Networker look only at its target disks?	SANITY::LEMONS	And we thank you for your support.	`Thu Apr 03 1997 07:10`	60
	Hi May I re-open this nfs discussion? Last night, we had backups on 6 clients time-out and fail, because these 6 clients all had the same nfs disk mounted. NOTE: none of the clients have this nfs disk, or any nfs disk, listed as a partition for NetWorker to back up. And yet, the NetWorker backups hung. Why? When I enter just 'df' on one of the systems on which backups timed-out and failed, I see: biggun-23: df NFS2 fsstat failed for server cadsys : RPC: Timed out ^c Then, I tried this command, which specifically excludes nfs disks: biggun-24: df -t nonfs Filesystem 512-blocks Used Available Capacity Mounted on root_domain#root 199040 121624 63104 66% / /proc 0 0 0 100% /proc usr_domain#usr 2347072 1512124 785344 66% /usr var_domain#var 1564352 155288 1394752 11% /var iss_work_domain#iss_work 4110480 197330 3867840 5% /biggun/iss_work proj8_domain#proj8 4110480 2123910 1974256 52% /biggun/proj8 proj9_domain#proj9 4110480 92428 4006128 3% /biggun/proj9 proj10_domain#proj10 4110480 32 4085296 1% /biggun/proj10 proj11_domain#proj11 4110480 3692640 396576 91% /biggun/proj11 proj12_domain#proj12 4110480 523910 3560448 13% /biggun/proj12 alt_root_domain#root 199040 78546 106512 43% /alt_root alt_usr_domain#usr 2347072 32 2303472 1% /alt_usr alt_var_domain#var 1564352 32 1551568 1% /alt_var biggun-25: When NetWorker interrogates the disks mounted on the client, does it: 1. attempt to list all mounted disks 2. attempt to list all non-NFS mounted disks 3. attempt to list only the disks it has been told to back up? It appears that option #1 is done, where option #3 should be done, and option #2 would at least work. As doing a list of all mounted disks provides no benefit that I can see, I view Networker's attempt to do so a bug. Thoughts? Thanks! tl
483.4	try #2	DECWET::EVANS	NSR Engineering	`Thu Apr 03 1997 09:10`	11
	NetWorker passes an RPC message from server to client... which client?? all the clients in the savegroup. How did it figure out which routing to use?? system calls using BIND, which are the same network stuff as NFS. NetWorker relies upon system calls to resolve hostnames. If those system calls result in an NFS usage occuring, then your still stuck in NFS-land. Hence the server-side behaviour. This is base Legato code, not Digital porting changes, ergo, we need to file an enhancement request to Legato.
483.5		SANITY::LEMONS	And we thank you for your support.	`Thu Apr 03 1997 10:04`	9
	Hi The client is NetWorker for Digital UNIX V4.2B. Your reply mentions BIND, and resolving system calls to hostnames. Could I take a step back, and ask why NetWorker attempts to get a list of all disks on the system? That, to me, seems like the problem. tl
483.6	check for mount points is important to NetWorker correctness	DECWET::CARRUTHERS	Life gets easier when you realize you can't have everything.	`Thu Apr 03 1997 10:23`	7
	and stat/fstat calls are the standard way to determine if any file is a mount point. As Bruce mentioned in /1, this a system level (UNIX, NFS) issue. {Remember, all mount points don't have to be listed in /etc/fstab. Many is the time I have mounted large, remote file system on my desktop at the /mnt file and left them mounted for days. I sure am glad NetWorker knows not to back up those file systems, through my desktop.}
483.7	soft option	BACHUS::DEVOS	Manu Devos DEC/SI Brussels 856-7539	`Thu Apr 03 1997 11:54`	9
	Hi tl (t?) You can also change the fstab file such that the NFS filesystem(s) are mounted with the "soft" option. So, after a reasonable amount of timeout and retries, the fstat/start system calls give up with and error instead of hang up indefinitely... Manu.
483.8		SANITY::LEMONS	And we thank you for your support.	`Fri Apr 04 1997 09:06`	44
	Thanks for this discussion. I still think I'm missing the point. I understand that NetWorker relies on UNIX and its add-ons (like nfs) to access the disks that it backs up. If UNIX can't access the disk, than Networker can't either. I'm certainly okay with that. My concern is that I don't want NetWorker backups to fail on a client, when it can't access one of the disks. I want NetWorker to do whatever work it can. I don't understand nfs very well, but I do know that we use nfs 'soft' mounts, as in: /usr@cadsrv:/server_usr:ro:0:0:nfs:bg,soft,intr,timeo=12,retrans=5, retry=10: When a new NetWorker client is created, Saveset has a default value of 'All'. So, NetWorker would have to find the list of all the disks on the system, and back up each one. Right? But we don't do that; we explicitly list each disk/partition we want to save. So there is no need for the (apparent) full-system list of disks that NetWorker tries to obtain. I feel that, if the list of Savesets is not 'All', then NetWorker should NOT attempt to list all disks, but should check the status of the disks/partitions listed in the Saveset field ONLY. That would step completely around this NFS problem, as we heed NetWorker's suggestion, and do not backup any NFS-mounted disks. What I don't completely understand is why NetWorker times out after 33 minutes. My read of the man pages for the mount parameters in /etc/fstab is that the NFS disk access should time out after 6 seconds. Any thoughts on that? Thanks! tl [from the ULTRIX V4.3 'man 8nfs mount' man page:] retrans=n Set number of NFS operation retransmissions (not the mount) to n. The retrans= option applies after the mount has succeeded. retry=n Set number of mount failure retries to n. The retry= option applies to the mount command, itself. timeo=n Set NFS timeout to n tenths of a second.
483.9		DECWET::FARLEE	Insufficient Virtual um...er....	`Fri Apr 04 1997 09:52`	20
	Terry, I agree with you that the behavior you suggest is reasonable, and what "should happen". I will try to walk through the code when I get a chance to find out what is really happening, but it won't be for a week or so. Can you tell me if the client times out during the probe, or partway through a save? That would distinguish between the two possibilities that I can see: 1) Regardless of the "savesets" field, we check every mounted filesystem at "probe" time when we're trying to figure out what to save. If this is happening, we'll fix it. 2) During the saving of a filesystem, we stat each directory that we walk into. If that directory happens to be the mountpoint for an NFS filesystem, we hang. Not sure what we could do about this one. Kevin
483.10		KAHLUA::LEMONS	And we thank you for your support.	`Fri Apr 04 1997 10:19`	41
	Hi Kevin Thanks for validating my suggestion, and for offering to walk the code at a later date. Here are some lines from the /nsr/logs/messages file. Please let me know if they don't answer your question. Apr 3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ asavegrp: authtype nsrexec Apr 3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ has been inactive for 30 minutes since Thu Apr 3 02:21:12 1997. Apr 3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ is being abandoned by asavegrp. Apr 3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:probe abandoned. Apr 3 05:38:34 robot1 last message repeated 10 times Apr 3 05:38:34 robot1 crsupp: Apr 3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ asavegrp: authtype nsrexec Apr 3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ has been inactive for 32 minutes since Thu Apr 3 03:25:51 1997. Apr 3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ is being abandoned by asavegrp. Apr 3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:probe abandoned. Apr 3 05:38:34 robot1 last message repeated 7 times Apr 3 05:38:34 robot1 crsupp: Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ asavegrp: authtype nsrexec Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ has been inactive for 30 minutes since Thu Apr 3 01:15:07 1997. Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ is being abandoned by asavegrp. Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:probe abandoned. Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:probe abandoned. Apr 3 05:38:34 robot1 crsupp: * cadsrv.hlo.dec.com:/ save: cannot stat /cadsys/aloe_build: Connection timed out Apr 3 05:38:34 robot1 crsupp: * cadsrv.hlo.dec.com:/ save: cannot stat /cadsys/tsc: Connection timed out Thanks! tl