NFS notes

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Stuff vaguely related to storing, hosting, and transferring files and media:

Background

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

NFS is a *nix way of using a remote filesystem almost as if it were local.


Versions

NFSv1 was Sun's in-house test version

NFSv2 (1989), RFC 1094

  • UDP only
  • 2GB filesize limit (32-bit offsets; later versions do 64-bit)

NFSv3 (1995), RFC 1813

  • added TCP transport (which itself has a few upsides)
  • added locking
  • more efficient writing(verify)

NFSv4 (2000, 2003), RFC 3530

  • added state
  • better security (how?)
  • better performance (how?)
  • uses only port 2049.
Doesn't need to interact with rpcbind, lockd, or rpc.statd. Will locally interact with rpc.mountd

NFS4.1 (2010), RFC 5661

  • adds parallel NFS (pNFS), which means striping data onto multiple servers (sort of network RAID0). Metadata and locking is still centralized.
  • preliminary linux support since (approx) 2.6.32 [1]
  • http://lwn.net/Articles/313437/


See also:



Related services/processes

For NFS to work you need

  • rpcbind, and port 111 open
  • nfsd, and port 2049 open
  • mountd, on any port nfsd chooses, though it can be fixed for ease of firewalling

Since NFS is often used only within subnets/VLANs, this "needs to be open" often isn't much of an issue.


rpcbind and portmap(per)

v3/v4 is known as rpcbind, v2 was known as portmapper

A service on TCP/UDP port 111.

Takes requests for ONC-RPC program-version pairs, returns the UDP/TCP network port it is running on. So the first step for ONC-RPC clients, e.g. NFS. Also allows version negotiation.


If you want to see what sort of thing it reports, run:

rpcinfo -p 

rpcinfo is also a nice test whether you can reach the portmapper on another host.


nfsd

Receivers service requests, and typically hands them off to mountd.


mountd

Carries out requests received from nfsd.



See also:

-->

Setting up

Server install

Install NFS server. This tends to also set it up to start automatically.

In debian/ubuntu you probably want:

sudo apt-get install nfs-kernel-server


The default exports file is likely empty, so you'll certainly want to look at the next step:

Server config - shares (/etc/exports)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

/etc/exports is a list of local paths to expose to the network. It has entries like:

/home 192.168.0.1(rw) 192.168.0.2(rw)
/files *(rw,all_squash, subtree_check)
/home 192.168.0.0/255.255.255.0(rw)
/files *(rw,sync,no_subtree_check,anonuid=222,anongid=1001)

... in other words, a path, and one or more specifications of which hosts/nets can access and which options apply to them.


After editing, you probably want to tell the daemon to re-read this file:

sudo exportfs -ra

(You can also restart the daemon, but that's not very nice to connected clients)


Specifying hosts/networks that can access a share:

  • specific IP
  • network, as in IP plus netmask (style: /22 or /255.255.252.0. Default netmask is effectively /32)
  • * means all hosts (see also hosts.allow, hosts.deny - and your firewall)
  • @group refers to a NIS netgroup


For each such host/net, you can apply further options:

  • tcp or udp
NFSv4 default is tcp
TCP is generally preferred - performs better on loaded networks (for a few reasons), fewer stale handle issues.
  • ro or rw - for read-only or read-write sharing
  • user squashing controls as who things get done on the server (Not very useful as a UID translator)
    • root_squash (default) - remote root is locally mapped to local anonymous user
    • no_root_squash - remote root is local root (only do this when you know you really want it. Can make sense on thin clients)
    • all_squash - map all incoming uids/gids mapped to local anonymous user.
    • anonuid and anongid - specify the anonymous user to use (per-share setting)
  • sync or async (largely a speed tweak)
    • sync means nfs will reply only once writes are committed to disk
    • async means it reacts before this. Has to be requested explicitly because technically it violates NFS specs. The risk is basically that of any writeback cache: data loss when the system is interrupted (e.g. power loss) before data is actually written to disk.
    • default was sync, recently has been async (verify)
  • no_subtree_check (also a speed tweak)
    • NFS usually checks that each served file is in the exported tree. That check isn't trivial, so disabling that makes some things faster
    • subtree checking doesn't deal well with renames, which (verify)
    • has some security implications when you also use no_root_squash (verify)
  • wdelay / no_wdelay
if the server thinks another write may follow immediately, it can delay a current write to disk. Can lessen the amount of disk seeks and increase write speed, in the same way that
wdelay makes sense when you know a lot of writes are sequential.
if you know most are random, then a delay is unnecessary and you're better of disabling this
  • secure/insecure - disallow/allow serving from ports below 1024

Server config - access control (hosts)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Optional, and it may be more centralized/easier to manage this via your general firewall.


...yet NFS tutorials often do it via the /etc/hosts* files, for example:

in /etc/hosts.allow:

ALL: 2.11.1.2[4-5]
ALL: 192.168.1.*


...or, instead of ALL:

portmap: 192.168.0.1 , 192.168.0.2
lockd: 192.168.0.1 , 192.168.0.2
rquotad: 192.168.0.1 , 192.168.0.2
mountd: 192.168.0.1 , 192.168.0.2
statd: 192.168.0.1 , 192.168.0.2

and in /etc/hosts.deny

portmap: ALL
lockd:ALL
mountd:ALL
rquotad:ALL
statd:ALL

On permissions, names and IDs

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


NFS itself sends the UID and GID -- which are numbers. The user names (that are viewed, and created) are based on the NFS client's account list. Group membership is (implicitly) also the client's.


You can get away with that if you have only one client who is the sole user of the NFS server, since then there is only one view on the data. So adding a large file server only to a workhorse is simple.


When you share a NFS node, things are more complex.

You can't really do without either synchronizing the user list - and/or mapping them. NIS (YP) is one way of automating that management.


Note that NFSv4 does things a little differently than NFS2 and 3. (I have yet to figure out how)

Client mount

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A NFS fstab line looks like

host:/exported/path      /mountpoint   nfs4  defaults  0 0


filesystem types:

  • nfs - for NFSv2 and NFSv3 (Default is to try 3, fall back to 2. You can force one via nfsvers=2 or nfsvers=3)
  • nfs4 - NFSv4 (doesn't do 3 or 2(verify))
    • typically preferable now, both for features and performance


Speed tweaks (client)

Noatime

See the general noatime discussion.

Arguably that's the only one that has much effect these days.


Attribute cache

ac (default) / noac - clients cache file attributes (updated regularly, can be specified to be in 3..60 seconds).

Makes the client react faster, but means multiple clients looking at the same thing won't see/detect file changes (e.g. updates by mtime) as quickly.


This can break things that make specific assumptions (which exactly?(verify)).

You can often work around that, e.g. with file locking (or lock files?) for signalling between them

See also:

Data cache

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Look at:

See also things like:


wsize, rsize

wsize, rsize - buffer sizes for writes and reads.

Best choice can vary with kernel, NFS implementation, amount of clients, hardware, sync/async
e.g. kernels before 2.6 had one relevant limitation - that you now do not care about
tl;dr: leave it be, particularly with NFSv4 (unless you want to do thorough testing)
Defaults and limitations vary. (Often 4KByte or 8KByte? (verify), or to auto-negotiate the largest supported value (verify)) (which is roughly why adding rsize=8192,wsize=8192 can make things a little more predictable for admins)
Minimum is 4KByte(verify), value rounded to the nearest multiple of 1KByte (or nearest two-power?)(verify)
Recent clients and servers may support much larger block sizes (verify)
NFSv4 behaviour is basically sane? (verify)
NFSv3 has a max defined by the server, often 32KByte or 64KByte(verify)
NFSv2 protocol has an implicit maximum of 8KByte
When using UDP you may wish to tweak this to at most your MTU(verify)
risks:
small reads are slower with large block size?(verify)
handling many clients with very large sizes tends to be choppy(verify)
some (mostly older) network cards/drivers don't deal with large chunks well at all
larger blocks will more easily cause timeouts, so if you set larger blocks, set higher timout value you set this large on TCP, look at the size of tour TCP (receive and transmit) window sizes as well
single-client benchmark not remotely representative of issues under real load
It isn't yet clear to me how large is too large, but it over 128k degrades more easily


See also:


Historical

  • ≤NFSv2: async (but since NFSv3 this probably isn't worth it anymore!)

On hanging accesses

If you mind these, you may prefer hard,intr over the default (hard,nointr).


Avoid soft, or know what that means to robustness:

  • on read-only filesystems it means broken-off reads (often acceptable),
  • on read-write filesystems it may mean half-written data (often not so acceptable).
of course, breaking any network-connected filesystem won't really deal with a more-than-transient disconnect well
  • intr / nointr (default is nointr)
with intr, we allow signals (think Ctrl-C, kill) to interrupt a client's NFS calls
upside: we can break hanging
downside: any impatient user can cause half-updated/half-finished files
Arguably better than soft/timeout (see below), because you can avoid automatically failing, but still avoid fairly hard locks.


How to handle NFS request timeouts:

  • hard (default) - retry indefinitely.
    • The cause of the "various filesystem things are hanging" problem, basically because this means a client will wait until it reconnects to a server and complete its operations(verify). Is also safer for your data, for the same reason.
  • soft - time out after some amount of tries.
    • Useful when you want to deal with servers that went offline.
    • ...but this can break reading programs, and corrupt data if you abort in the middle of writes.
    • Arguably, soft is more acceptable on read-only than on read-write filesystems
    • If you don't change the defaults of the two settings mentioned below, we fail after 18 seconds (6+6+6) on TCP (42 seconds (6+12+24) on UDP). (verify)
  • how often / how many:
    • timeo interval of retry, in tenths of a second. Default is 60 (6 seconds). NFS-over-UDP does doubling backoff to at most 60 seconds (600), NFS over TCP uses the given value regularly.
    • retrans - (default: 3, 4, or 5) how many retries to do before giving up (when using soft), or before emitting a message and continue (when using hard)


RPC can't really tell the difference between congestion, server load, or even a stopped NFS server. So yes, stopping an NFS server will cause a lot of waiting on all clients, and there is little you can do about that. (It's a fairly low-level service you're taking away, after all.)

Note that there is a significant difference between UDP and TCP transport: UDP is fire-and-forget, and will retransmit the whole request (which may be large). TCP has reliable transport by itself, so timeout/retransmit only really happens when the server has serious resource problems (arguably meaning you should be prepared to wait longer, i.e. use highish timeout values, that are longer than TCP timeouts)

If you use soft, you don't want to set the timeout too low, because you don't want to abort IO if the problem was temporary (e.g. congestion, server load). And when congestion/load are to some degree expected, perhaps you should fix that rather than use mount options that may break programs.

useful commands

Semi-sorted

NFS and quotas

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Speed tweaks (server)

General

  • Use TCP if you can (UDP is sensitive to congestion, fragmenting)
  • Use NFSv3 over v2 if you can, v3 is more asynchronous more safely
  • Use NFSv4 over v3 if you can, v4 has some details that allow for more aggressive caching (e.g. file delegation)
  • Ensure your system-wide TCP window settings are high enough for your typical RTT (on both sides)

At least 256KB may be a good idea


See also things like:


Server-specific

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • increase the amount of threads nfsd allows. Spreading performance is often preferable than waiting clients. At least a few per server CPU core can't seem to hurt. The value also depends on how many clients you serve.
You can check how necessary this is by going to clients and running nfsstat -rc which should show a lack of retransmissions (situations where clients had to whine because they didn't get served)
Client rpc stats:
calls      retrans    authrefrsh
508486     0          508478

Client rpc stats:
calls      retrans    authrefrsh
3409166    330        0


Consider:

  • higher readahead
  • simpler scheduler for less latency (but do real evaluation for real workloads!), e.g.
echo deadline > /sys/block/sdb/queue/scheduler


WebNFS

Windows

Errors

mount.nfs4: Protocol not supported

Likely comes from the portmapper - the step that helps negotiate version and transport.

In which case it often either means:

  • The transport is the issue
e.g. TCP versus UDP, IPv4 versus IPv6
  • the two sides cannot agree on a version
since 4 is backwards compatible with 3 and 2(verify), this usually only happens in very specific caes, e.g. when the client forces 4 and the server is 3
One past example was Redhat/fedora disabling 4 in favour of 4.1 (verify)


In both cases, it tends to be very informative to go tot he client and run:

rpcinfo servername |egrep "service|nfs"

(In my case it showed only IPv6 listens -- because gluster bound its own NFS server (on IPv4 only))


If it's not clearly this, check your server logs


Hanging

What to do about hanging requests depend a little on the operation


When server really up and left

Will lead to hanging directory requests.

You can remove the mountpoint with umount -f -l (or just -l), where the -l (lazy) detaches the mount from the filesystem, and postpones the actual umount (in this case to never).

This is not a true fix, but means no one should be left hanging.



Hanging mount request

Often means rpcbind or nfsd are not reachable (firewalled).

...because when you mount, what happens is roughly:

  • The client asks the server (rpcbind) which port the NFS server is using
  • the client connects to the NFS server (nfsd)
  • nfsd passes the request to mountd
which determines whether access is permitted

You can check whether your client can see rpcbind with rpcinfo -p IP


If only rpcbind is not there but nfsd is, it may be that the mount request gives up on that part but then tries to reach nfsd and succeeds(verify)


Also consider

  • name resolve issues (particularly on the server)


http://www.troubleshooters.com/linux/nfs.htm

Stale file handles, rm: cannot remove `.nfssomebignumber': Device or resource busy

Those .nfs[bignumber] files appear when a currently open file on an NFS share gets removed. They will disappear when the process that has them open closes it.

You can check what process has them open with something like lsof . Keep in mind that this may be a process on another host.(verify)


The above describes one form of stale file handles.

When this happens with directories, you'll see something like:

$ ls
.: Stale File Handle


mount.nfs: Stale NFS file handle

pNFS

See also