[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::networker

Title:NetWorker
Notice:kits - 12-14, problem reporting - 41.*, basics 1-100
Moderator:DECWET::RANDALL.com::lenox
Created:Thu Oct 10 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:750
Total number of notes:3361

483.0. " NFS problem and savegroup hang " by EVTAI1::POUSSARD () Thu Mar 13 1997 03:24

	Hi,

	I red the remark in the addendum about unavailable NFS file systems, 
but I would have more clarifications.


	I would know the exact mechanism which happens when a NFS server is 
not responding.

	A customer of me, had the classical problem of the " NFS server 
server not responding, still trying" message. The NFS client machine which 
had this message was a Networker/NSR server, and the NFS file system was 
mounted with (hard, intr) options.


	The savegroup command which save the NSR server filesystems began at 
19:00pm and finish at 19:05pm. We have in /nsr/logs/messages the successful 
completion message
	At 19:45pm, we have the "NFS server server not responding, still 
trying" message, on the NSR server ( and only on the NSR server )

	At 20:00pm , the savegroup commands which backup all NSR clients 
began , but all savegroup commands become uninterruptible.
	
	The next morning, the customer tries to umount the NFS file system 
without success. He remarks that all savegroup commands hang since 20:00pm 
before

	He is obliged to reboot the NSR server, to avoid the NFS problem



	Assuming that savegroup command ran from root account, I said him to 
verify that there was no NFS path in the root's  PATH environnement variable 
concerning the annoying NFS server. This was not the case, so the savegroup 
command should not hang, unless savegroup do something like the df command, 
trying to access all file systems, or have a specific PATH.


	Can I have an explanation about that 



	Thanks for your replies



				Gilles.
T.RTitleUserPersonal
Name
DateLines
483.1well, let me tryDECWET::EVANSNSR EngineeringThu Mar 13 1997 11:3514
when NFS goes out to lunch, there's not much *any* application (such as
 NetWorker) can do. Especially when the filesystem is mounted hard.

NetWorker passes an RPC message from NSR-server to NSR-client (nsrexecd)
 which then  tries to fstat each filesystem to gather info about "local"
 systems to backup. It's here that fstat hits the NFS mountpoint, and if
 NFS is gone, the fstat system call just does not return.

I see 2 points of failure: the RPC system call, and the fstat system call -
 both rely upon the network.

Thus, this is a system level issue, not really NetWorker (NFS, Unix)

Did you try to restart your network???
483.2NFS problem on the NSR serverEVTAI1::POUSSARDFri Mar 14 1997 01:308
    	The problem here is that the NFS problem occured on the NSR server
    , not on the NSR clients, and savegroups which started at 20:00pm had 
    nothing to do with the NSR server filesystems
    
               
    	Gilles.
    
    
483.3Could Networker look only at its target disks?SANITY::LEMONSAnd we thank you for your support.Thu Apr 03 1997 08:1060
    Hi
    
    May I re-open this nfs discussion?  Last night, we had backups on 6
    clients time-out and fail, because these 6 clients all had the same nfs
    disk mounted.  NOTE: none of the clients have this nfs disk, or any nfs
    disk, listed as a partition for NetWorker to back up.  And yet, the
    NetWorker backups hung.  Why?
    
    When I enter just 'df' on one of the systems on which backups timed-out
    and failed, I see:
    biggun-23: df
    NFS2 fsstat failed for server cadsys : RPC: Timed out
    ^c
    Then, I tried this command, which specifically excludes nfs disks:
    biggun-24: df -t nonfs
    Filesystem                512-blocks        Used   Available Capacity 
    Mounted on
    root_domain#root              199040      121624       63104    66%   
    /
    /proc                              0           0           0   100%   
    /proc
    usr_domain#usr               2347072     1512124      785344    66%   
    /usr
    var_domain#var               1564352      155288     1394752    11%   
    /var
    iss_work_domain#iss_work     4110480      197330     3867840     5%   
    /biggun/iss_work
    proj8_domain#proj8           4110480     2123910     1974256    52%   
    /biggun/proj8
    proj9_domain#proj9           4110480       92428     4006128     3%   
    /biggun/proj9
    proj10_domain#proj10         4110480          32     4085296     1%   
    /biggun/proj10
    proj11_domain#proj11         4110480     3692640      396576    91%   
    /biggun/proj11
    proj12_domain#proj12         4110480      523910     3560448    13%   
    /biggun/proj12
    alt_root_domain#root          199040       78546      106512    43%   
    /alt_root
    alt_usr_domain#usr           2347072          32     2303472     1%   
    /alt_usr
    alt_var_domain#var           1564352          32     1551568     1%   
    /alt_var
    biggun-25:
    
    When NetWorker interrogates the disks mounted on the client, does it:
    1. attempt to list all mounted disks
    2. attempt to list all non-NFS mounted disks
    3. attempt to list only the disks it has been told to back up?
    
    It appears that option #1 is done, where option #3 should be done, and
    option #2 would at least work.
    
    As doing a list of all mounted disks provides no benefit that I can
    see, I view Networker's attempt to do so a bug.
    
    Thoughts?
    
    Thanks!
    tl
483.4try #2DECWET::EVANSNSR EngineeringThu Apr 03 1997 10:1011
NetWorker passes an RPC message from server to client... which client??

  all the clients in the savegroup. How did it figure out which routing to
 use?? system calls using BIND, which are the same network stuff as NFS.

NetWorker relies upon system calls to resolve hostnames. If those system
 calls result in an NFS usage occuring, then your still stuck in NFS-land.
 Hence the server-side behaviour.

This is base Legato code, not Digital porting changes, ergo, we need to
 file an enhancement request to Legato.
483.5SANITY::LEMONSAnd we thank you for your support.Thu Apr 03 1997 11:049
    Hi
    
    The client is NetWorker for Digital UNIX V4.2B.
    
    Your reply mentions BIND, and resolving system calls to hostnames. 
    Could I take a step back, and ask why NetWorker attempts to get a list
    of all disks on the system?  That, to me, seems like the problem.
    
    tl
483.6check for mount points is important to NetWorker correctnessDECWET::CARRUTHERSLife gets easier when you realize you can't have everything.Thu Apr 03 1997 11:237
and stat/fstat calls are the standard way to determine if any file is a mount
point.  As Bruce mentioned in /1, this a system level (UNIX, NFS) issue.  

{Remember, all mount points don't have to be listed in /etc/fstab.
Many is the time I have mounted large, remote file system on my desktop at
the /mnt file and left them mounted for days.  I sure am glad NetWorker knows 
not to back up those file systems, through my desktop.}
483.7soft optionBACHUS::DEVOSManu Devos DEC/SI Brussels 856-7539Thu Apr 03 1997 12:549
    Hi tl (t?)
    
    You can also change the fstab file such that the NFS filesystem(s) are
    mounted with the "soft" option. So, after a reasonable amount of
    timeout and retries, the fstat/start system calls give up with and
    error instead of hang up indefinitely...
    
    Manu.
    
483.8SANITY::LEMONSAnd we thank you for your support.Fri Apr 04 1997 10:0644
    Thanks for this discussion.  I still think I'm missing the point.  I
    understand that NetWorker relies on UNIX and its add-ons (like nfs) to
    access the disks that it backs up.  If UNIX can't access the disk, than
    Networker can't either.  I'm certainly okay with that.
    
    My concern is that I don't want NetWorker backups to fail on a client,
    when it can't access one of the disks.  I want NetWorker to do whatever
    work it can.  I don't understand nfs very well, but I do know that we
    use nfs 'soft' mounts, as in:
    
    /usr@cadsrv:/server_usr:ro:0:0:nfs:bg,soft,intr,timeo=12,retrans=5,
    retry=10:
    
    When a new NetWorker client is created, Saveset has a default value
    of 'All'.  So, NetWorker would have to find the list of all the disks
    on the system, and back up each one.  Right?
    
    But we don't do that; we explicitly list each disk/partition we want to
    save.  So there is no need for the (apparent) full-system list of disks
    that NetWorker tries to obtain.
    
    I feel that, if the list of Savesets is not 'All', then NetWorker
    should NOT attempt to list all disks, but should check the status of
    the disks/partitions listed in the Saveset field ONLY.  That would step
    completely around this NFS problem, as we heed NetWorker's suggestion,
    and do not backup any NFS-mounted disks.
    
    What I don't completely understand is why NetWorker times out after 33
    minutes.  My read of the man pages for the mount parameters in
    /etc/fstab is that the NFS disk access should time out after 6 seconds.
    Any thoughts on that?
    
    Thanks!
    tl
    
    [from the ULTRIX V4.3 'man 8nfs mount' man page:]
    retrans=n     Set number of NFS operation retransmissions (not the
    mount) to n. The retrans= option applies after the mount has succeeded.
    
    retry=n       Set number of mount failure retries to n. The retry=
    option applies to the mount command, itself.
    
    timeo=n       Set NFS timeout to n tenths of a second.
    
483.9DECWET::FARLEEInsufficient Virtual um...er....Fri Apr 04 1997 10:5220
Terry,

I agree with you that the behavior you suggest is reasonable, and
what "should happen".  I will try to walk through the code when I get 
a chance to find out what is really happening, but it won't be for a week or so.

Can you tell me if the client times out during the probe, or partway
through a save?  That would distinguish between the two possibilities that
I can see:

1) Regardless of the "savesets" field, we check every mounted filesystem
	at "probe" time when we're trying to figure out what to save.
	If this is happening, we'll fix it.

2) During the saving of a filesystem, we stat each directory that
	we walk into.  If that directory happens to be the mountpoint
	for an NFS filesystem, we hang.  Not sure what we could do 
	about this one.

Kevin
483.10KAHLUA::LEMONSAnd we thank you for your support.Fri Apr 04 1997 11:1941
    Hi Kevin
    
    Thanks for validating my suggestion, and for offering to walk the code
    at a later date.
    
    Here are some lines from the /nsr/logs/messages file.  Please let me
    know if they don't answer your question.
    
    Apr  3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ asavegrp:
    authtype nsrexec
    Apr  3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ has been inactive
    for 30 minutes since Thu Apr  3 02:21:12 1997.
    Apr  3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ is being
    abandoned by asavegrp.
    Apr  3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:probe abandoned.
    Apr  3 05:38:34 robot1 last message repeated 10 times
    Apr  3 05:38:34 robot1 crsupp:
    Apr  3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ asavegrp:
    authtype nsrexec
    Apr  3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ has been inactive
    for 32 minutes since Thu Apr  3 03:25:51 1997.
    Apr  3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ is being
    abandoned by asavegrp.
    Apr  3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:probe abandoned.
    Apr  3 05:38:34 robot1 last message repeated 7 times
    Apr  3 05:38:34 robot1 crsupp:
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ asavegrp:
    authtype nsrexec
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ has been inactive
    for 30 minutes since Thu Apr  3 01:15:07 1997.
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ is being
    abandoned by asavegrp.
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:probe abandoned.
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:probe abandoned.
    Apr  3 05:38:34 robot1 crsupp: * cadsrv.hlo.dec.com:/ save: cannot stat
    /cadsys/aloe_build: Connection timed out
    Apr  3 05:38:34 robot1 crsupp: * cadsrv.hlo.dec.com:/ save: cannot stat
    /cadsys/tsc: Connection timed out
    
    Thanks!
    tl