T.R | Title | User | Personal Name | Date | Lines |
---|
483.1 | well, let me try | DECWET::EVANS | NSR Engineering | Thu Mar 13 1997 11:35 | 14 |
| when NFS goes out to lunch, there's not much *any* application (such as
NetWorker) can do. Especially when the filesystem is mounted hard.
NetWorker passes an RPC message from NSR-server to NSR-client (nsrexecd)
which then tries to fstat each filesystem to gather info about "local"
systems to backup. It's here that fstat hits the NFS mountpoint, and if
NFS is gone, the fstat system call just does not return.
I see 2 points of failure: the RPC system call, and the fstat system call -
both rely upon the network.
Thus, this is a system level issue, not really NetWorker (NFS, Unix)
Did you try to restart your network???
|
483.2 | NFS problem on the NSR server | EVTAI1::POUSSARD | | Fri Mar 14 1997 01:30 | 8 |
| The problem here is that the NFS problem occured on the NSR server
, not on the NSR clients, and savegroups which started at 20:00pm had
nothing to do with the NSR server filesystems
Gilles.
|
483.3 | Could Networker look only at its target disks? | SANITY::LEMONS | And we thank you for your support. | Thu Apr 03 1997 08:10 | 60 |
| Hi
May I re-open this nfs discussion? Last night, we had backups on 6
clients time-out and fail, because these 6 clients all had the same nfs
disk mounted. NOTE: none of the clients have this nfs disk, or any nfs
disk, listed as a partition for NetWorker to back up. And yet, the
NetWorker backups hung. Why?
When I enter just 'df' on one of the systems on which backups timed-out
and failed, I see:
biggun-23: df
NFS2 fsstat failed for server cadsys : RPC: Timed out
^c
Then, I tried this command, which specifically excludes nfs disks:
biggun-24: df -t nonfs
Filesystem 512-blocks Used Available Capacity
Mounted on
root_domain#root 199040 121624 63104 66%
/
/proc 0 0 0 100%
/proc
usr_domain#usr 2347072 1512124 785344 66%
/usr
var_domain#var 1564352 155288 1394752 11%
/var
iss_work_domain#iss_work 4110480 197330 3867840 5%
/biggun/iss_work
proj8_domain#proj8 4110480 2123910 1974256 52%
/biggun/proj8
proj9_domain#proj9 4110480 92428 4006128 3%
/biggun/proj9
proj10_domain#proj10 4110480 32 4085296 1%
/biggun/proj10
proj11_domain#proj11 4110480 3692640 396576 91%
/biggun/proj11
proj12_domain#proj12 4110480 523910 3560448 13%
/biggun/proj12
alt_root_domain#root 199040 78546 106512 43%
/alt_root
alt_usr_domain#usr 2347072 32 2303472 1%
/alt_usr
alt_var_domain#var 1564352 32 1551568 1%
/alt_var
biggun-25:
When NetWorker interrogates the disks mounted on the client, does it:
1. attempt to list all mounted disks
2. attempt to list all non-NFS mounted disks
3. attempt to list only the disks it has been told to back up?
It appears that option #1 is done, where option #3 should be done, and
option #2 would at least work.
As doing a list of all mounted disks provides no benefit that I can
see, I view Networker's attempt to do so a bug.
Thoughts?
Thanks!
tl
|
483.4 | try #2 | DECWET::EVANS | NSR Engineering | Thu Apr 03 1997 10:10 | 11 |
| NetWorker passes an RPC message from server to client... which client??
all the clients in the savegroup. How did it figure out which routing to
use?? system calls using BIND, which are the same network stuff as NFS.
NetWorker relies upon system calls to resolve hostnames. If those system
calls result in an NFS usage occuring, then your still stuck in NFS-land.
Hence the server-side behaviour.
This is base Legato code, not Digital porting changes, ergo, we need to
file an enhancement request to Legato.
|
483.5 | | SANITY::LEMONS | And we thank you for your support. | Thu Apr 03 1997 11:04 | 9 |
| Hi
The client is NetWorker for Digital UNIX V4.2B.
Your reply mentions BIND, and resolving system calls to hostnames.
Could I take a step back, and ask why NetWorker attempts to get a list
of all disks on the system? That, to me, seems like the problem.
tl
|
483.6 | check for mount points is important to NetWorker correctness | DECWET::CARRUTHERS | Life gets easier when you realize you can't have everything. | Thu Apr 03 1997 11:23 | 7 |
| and stat/fstat calls are the standard way to determine if any file is a mount
point. As Bruce mentioned in /1, this a system level (UNIX, NFS) issue.
{Remember, all mount points don't have to be listed in /etc/fstab.
Many is the time I have mounted large, remote file system on my desktop at
the /mnt file and left them mounted for days. I sure am glad NetWorker knows
not to back up those file systems, through my desktop.}
|
483.7 | soft option | BACHUS::DEVOS | Manu Devos DEC/SI Brussels 856-7539 | Thu Apr 03 1997 12:54 | 9 |
| Hi tl (t?)
You can also change the fstab file such that the NFS filesystem(s) are
mounted with the "soft" option. So, after a reasonable amount of
timeout and retries, the fstat/start system calls give up with and
error instead of hang up indefinitely...
Manu.
|
483.8 | | SANITY::LEMONS | And we thank you for your support. | Fri Apr 04 1997 10:06 | 44 |
| Thanks for this discussion. I still think I'm missing the point. I
understand that NetWorker relies on UNIX and its add-ons (like nfs) to
access the disks that it backs up. If UNIX can't access the disk, than
Networker can't either. I'm certainly okay with that.
My concern is that I don't want NetWorker backups to fail on a client,
when it can't access one of the disks. I want NetWorker to do whatever
work it can. I don't understand nfs very well, but I do know that we
use nfs 'soft' mounts, as in:
/usr@cadsrv:/server_usr:ro:0:0:nfs:bg,soft,intr,timeo=12,retrans=5,
retry=10:
When a new NetWorker client is created, Saveset has a default value
of 'All'. So, NetWorker would have to find the list of all the disks
on the system, and back up each one. Right?
But we don't do that; we explicitly list each disk/partition we want to
save. So there is no need for the (apparent) full-system list of disks
that NetWorker tries to obtain.
I feel that, if the list of Savesets is not 'All', then NetWorker
should NOT attempt to list all disks, but should check the status of
the disks/partitions listed in the Saveset field ONLY. That would step
completely around this NFS problem, as we heed NetWorker's suggestion,
and do not backup any NFS-mounted disks.
What I don't completely understand is why NetWorker times out after 33
minutes. My read of the man pages for the mount parameters in
/etc/fstab is that the NFS disk access should time out after 6 seconds.
Any thoughts on that?
Thanks!
tl
[from the ULTRIX V4.3 'man 8nfs mount' man page:]
retrans=n Set number of NFS operation retransmissions (not the
mount) to n. The retrans= option applies after the mount has succeeded.
retry=n Set number of mount failure retries to n. The retry=
option applies to the mount command, itself.
timeo=n Set NFS timeout to n tenths of a second.
|
483.9 | | DECWET::FARLEE | Insufficient Virtual um...er.... | Fri Apr 04 1997 10:52 | 20 |
| Terry,
I agree with you that the behavior you suggest is reasonable, and
what "should happen". I will try to walk through the code when I get
a chance to find out what is really happening, but it won't be for a week or so.
Can you tell me if the client times out during the probe, or partway
through a save? That would distinguish between the two possibilities that
I can see:
1) Regardless of the "savesets" field, we check every mounted filesystem
at "probe" time when we're trying to figure out what to save.
If this is happening, we'll fix it.
2) During the saving of a filesystem, we stat each directory that
we walk into. If that directory happens to be the mountpoint
for an NFS filesystem, we hang. Not sure what we could do
about this one.
Kevin
|
483.10 | | KAHLUA::LEMONS | And we thank you for your support. | Fri Apr 04 1997 11:19 | 41 |
| Hi Kevin
Thanks for validating my suggestion, and for offering to walk the code
at a later date.
Here are some lines from the /nsr/logs/messages file. Please let me
know if they don't answer your question.
Apr 3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ asavegrp:
authtype nsrexec
Apr 3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ has been inactive
for 30 minutes since Thu Apr 3 02:21:12 1997.
Apr 3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ is being
abandoned by asavegrp.
Apr 3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:probe abandoned.
Apr 3 05:38:34 robot1 last message repeated 10 times
Apr 3 05:38:34 robot1 crsupp:
Apr 3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ asavegrp:
authtype nsrexec
Apr 3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ has been inactive
for 32 minutes since Thu Apr 3 03:25:51 1997.
Apr 3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ is being
abandoned by asavegrp.
Apr 3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:probe abandoned.
Apr 3 05:38:34 robot1 last message repeated 7 times
Apr 3 05:38:34 robot1 crsupp:
Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ asavegrp:
authtype nsrexec
Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ has been inactive
for 30 minutes since Thu Apr 3 01:15:07 1997.
Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ is being
abandoned by asavegrp.
Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:probe abandoned.
Apr 3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:probe abandoned.
Apr 3 05:38:34 robot1 crsupp: * cadsrv.hlo.dec.com:/ save: cannot stat
/cadsys/aloe_build: Connection timed out
Apr 3 05:38:34 robot1 crsupp: * cadsrv.hlo.dec.com:/ save: cannot stat
/cadsys/tsc: Connection timed out
Thanks!
tl
|