Networking notes - IPv4 notes

From Helpful
Revision as of 18:26, 30 July 2024 by Helpful (talk | contribs) (→‎The send and receive windows)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

For other network related things, see:

Also:

IPv4 addresses and nets

IPv4 network notation

The following are the same:

  • 192.0.2.0/255.255.255.0
  • 192.0.2.0/24 (CIDR notation)
  • 192.0.2/24
  • 192.0.2.


192.0.2/24 is an example of the CIDR-without-trailing-zeroes. Similarly, 10/8 means 10.0.0.0/8.

192.168. or 127. is another lazy variant, but less powerful as it's basically a start-of-string match, and implicitly octet-boundary-only.

Applications vary in which they support. Usually they stick to the first one or two.


These examples all mention masks/subnets split at whole octets. When you refer to subnets, you say how many (leftmost) bits matter to its definition.

Depending on what you are configuring, you can specify both (sub)net address and a netmask, such as 192.168.110.168/29. The application you hand this to may require you to have the IP have the rightmost bits cleared, as in that example (29 having a 3-bit host part, and 168 being 10101000). The rest of the bits are used by things on that net, so don't matter to the net specification.

IPv4 special addresses

Main relevant RFCs:

  • RFC3330, 'Special-Use IPv4 Addresses' (refers to most other RFCs mentioned here)
  • RFC1918, 'Address Allocation for Private Internets'
  • RFC1700, 'Assigned Numbers'


Well known special cases (see mainly RFC 1700 for these):

  • 255.255.255.255/32 is for 'limited broadcast', in that it should not be routed between nets. (technically in 240/4, see below)
  • 0.0.0.0/32 is regularly used with a meaning like 'any local IP' (a sort of 'this host'), particularly when configuring what IP(s)/adapters a service should listen on.
  • 127/8: Anything to this network should go to the local host, without going to a network card
It is common convention to configure 127.0.0.1/32 on a virtual loopback device.
This net is typically used for local-only uses such as IPC and some services.
can be served via domain sockets, which you can use as network sockets but are cheaper than putting things on the full network stack since the kernel knows beforehand it's local-only.


Private networks

The private network rangesare defined by RFC 1918:

  • 10.0.0.0/8, that is, 10.0.0.0 through 10.255.255.255
  • 172.16.0.0/12, that is, 172.16.0.0 through 172.31.255.255
  • 192.168.0.0/16, that is, 192.168.0.0 through 192.168.255.255


These are private in that these will not be routed onto the public internet.

This means you can use these networks at will, with no fear of conflicting with anyone else who uses such networks elsewhere.


To get hosts on IPv4 private nets to talk to the internet, you will need a gateway (which at home will be your modem).

To the rest of the world, you look like you come from your gateway(e.g. modem)'s public IP.


They were defined to delay the exhaustion of public IPv4 addresses. While these ~17 million addresses cannot exist on the internet, none of the hosts on a private net doesn't count towards the use of public IP addresses.

Which is why public nets are very commonly used for the home LAN behind your modem. And other cases where you case that hosts can reach the internet, but the internet doesn't need to initiate connections with you directly.


Notes:

  • Some of the above statements have exceptions - see e.g. the subnets used in VPN, what VLANs do to subnets, and more.
  • This is mostly an IPv4 implementation concept, in IPv6 it's a different story.

Other special cases

Other special cases:

  • 240/4 (240.0.0.0-255.255.255.255): "Reserved for future use"
  • 192.0.2/24: intended for examples in documentation and code, and should not be used on the internet (nor should it be confused with 192.168/16)
  • 198.18/15: to be used for bandwidth tests (see RFC2544)


There are various ranges reserved by IANA but not used (and, sometimes, unlikely to be used for the purpose they seem likely to be reserved to).

This includes 5/8 (which hamachi used) and various others, apparently including 128/8, 191.255/16, 192.0.0/24, 223.255.255/24 and the already-mentioned 240/4.


Also related:


Multicast
Link-only network

The 169.254/16 link-only network is an unroutable network described by RFC3927. (Microsoft has its own name for it, Automatic Private IP Addressing (APIPA))

These addresses may be assigned when a host configured to use DHCP doesn't get a response. (Note: it may keep trying to reach DHCP while such an IP is configured, it may not).


The idea is roughly that if your DHCP server is broken, hosts on that subnet may still see each other and communicate, in that every computer with this behaviour this will end up on the same subnet, and probably not with the same IP (it's a 64K-large range and we pick a number in it randomly).


In practice it isn't too useful because people want internets, but I've seen it used to copy a few files.


See also:

Bogons and martians

Bogons refer to packets that claim to come from networks that are currently not allocated for anyone to use (bogon space).


Martians are packets that claim to come from unroutable nets.


Both are good indicators that the traffic is spoofed, or sometimes that a router is very dumb or malfunctioning.


See also:

Subnetting(, routing)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Routed networking is largely concerned not with nodes or hosts, but with the network ranges they are on: if routers can put a packet on the net that the target host is on, switches and hosts themselves take care of the last step.


An IP (sub)network is often written/stored as a combination of an address combined with a mask that implies a range of addresses it is in.


From such a (IP_on_a_subnet, subnet_mask) pair you can derive the size of the net, the network address, the broadcast address, and the addresses that may be hosts.

(When referring to a network, you usually use the network address as the IP to mention in that IP/mask pair, e.g. 192.168.1.0/24, but in a lot of practical cases, any IP in that range (say, 192.168.1.47/24) will work equivalently, exactly because network code can derive the network address with just one bitmask operation)


There are different equivalent ways to refer to a network:

192.168.124.168/255.255.255.248  (a little more reflective of the network stack logic)
192.168.124.168/29               (equivalent, shorter for us to write)

The /number refers to how many leftmost bits are set (Technically, consecutive set bits are not strictly necessary in (sub)networks, but there is more confusion than added value to non-consecutive-bit network masking).

You can also mention them by start and end address, although this is more for reporting than anything else. It can be convenient for human checks when subnets don't start or end on byte edges (anything except /0. /8, /16, /24, and /32). For example, '145.99.239.168 though 145.99.239.175' makes more immediate sense to me than 145.99.239.168/29


When splitting a given net into subnets, you could think of it as using the bits for three different things: the existing network, the new subnetwork, and the host bits.

Note: this is only an aid to modelling, as there is no hard difference between the network and subnetwork bits, but I'd say it helps.

For example, if you take the private 192.168.0.0/16 network (192.168.0.0 through 192.168.255.255) and want to use it as 256 separate /24 networks (192.168.0.0/24, 192.168.1.0/24, 192.168.2.0/24, etc.), on a bit level you can think of it as:

nnnnnnnn nnnnnnnn ssssssss hhhhhhhh


Note that if one host has the right IP but the wrong netmask (such as 192.168.1.1/16 when it should be 192.168.1.1/24), this may go weird. Directed traffic may still work, but that hosts's broadcasts will probably not.



The earlier note about "routing is towards networks, not nodes" is true mostly because of what routing tables contain. What they contain also implies what sort of address the source/target is. For example, a routing tables may look something like:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
145.99.239.168  0.0.0.0         255.255.255.248 U     0      0        0 eth0
192.168.0.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
5.0.0.0         0.0.0.0         255.0.0.0       U     0      0        0 ham0
127.0.0.0       0.0.0.0         255.0.0.0       U     0      0        0 lo
0.0.0.0         192.168.2.1     0.0.0.0         UG    0      0        0 eth0

When deciding to what interface a packet should go, the packet's destination IP is tested against each entry (look at destination/genmask). The first rule that applies is used.


The last entry is often the default gateway, in an "anything that wasn't matched by the above is routed here" sort of rule.

This is handy in home networks, because that will include all of the public internet, and you know that your gateway will be doing the right thing (NAT) on your behalf.


See also:

VLANs

NAT

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


NAT itself refers to translating adresses.

Which can be a somewhat confusing subject.

  • source NAT is stateful, and does half of full nat, rewriting source addresses/ports
a.k.a. one-to-many NAT, mainly used for outgoing connections to public services.
  • destination NAT is stateful, and does the other half, rewriting target addresses/ports.

(So yes, full stateless NAT can be imitated with two stateful NATs)

  • In the same context, Full NAT rewrites packets to the end of making networks appear differently
stateless in the sense there is no need for memory of connections, as is often the case with source NAT and destination NAT


The last two are in some contexts called SNAT and DNAT (e.g. varied *nix networking), but other things have been acronym'd SNAT, and multiple things called DNAT. Also, what linux calls SNAT and DNAT are both types of static NAT, not dynamic NAT. Aren't acryonyms just so useful? In fact, there are distinct things called dynamic NAT.)

Note that you also need ip forwarding on for most of this, and some distros have it off by default, in which case you need to:

echo 1 > /proc/sys/net/ipv4/ip_forward

...or, actually, figure out which configuration file controls the system's doing so at bootup.



Source NAT and masquerding

In practice, Source NAT is mostly used to share an internet connection in a home network which internally has a private (IPv4) network.


In that case there is one host that does the translation (which is very typically also that network's gateway).

It takes connections from the local net that it notes are bound to go outside, rewrites the source IP so that the response will come back to the IP that that gateway has on said outside (where outside = internet).


It remembers that connection, so that once a packet does get back, it knows which local/private address to move that packet to.

Note that this is technically a destination-NAT step implicit in every source-NAT situation: The packet's destination is rewritten back to the local net (iptables remembered this) and the packet gets back to the original node). Yet the way this is usually configured, this is implied by the source-NATting.


e.g. in linux,

iptables -t nat -A POSTROUTING -o ppp0 -j SNAT --to $PUBLIC_IP

...where PUBLIC_IP is the SNATting node's outside, public IP):


Masquerade is an old version of using SNAT for this, that would automatically picks the interface's IP for the new source IP. The following is similar to the SNAT rule above:

iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE

Destination NAT

Using Destination NAT (linux abbreviates this DNAT) by itself means rewriting things that come in.

It is probably mostly used for port forwarding: if you e.g. have a routing DSL modem and want to connect to an internal computer's remote desktop, web server, or whatever else, use rules like:

iptables -t nat -A PREROUTING -d $externalip -p tcp -m tcp -dport 3389 -j DNAT --to-destination $internalip:3389 
iptables -t nat -A PREROUTING -d $externalip -p tcp -m tcp -dport 80   -j DNAT --to-destination $internalip:80 

...assuming said modem takes such rules somehow, it is connected to a computer that does so for it. The logic's sound.



STUN, TURN, ICE

https://dyte.io/blog/webrtc-102-demystifying-ice/



See also:

UPnP, IGD, NAT-PMP, etc.

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Two different protocols with the same intent: negotiating automatic port forwarding, so that manually configuring modem DNAT for port forwarding becomes unnecessary.

UPnP implements IGD and is Microsoft's implementation, NAT-PMP is Apple's. The latter came later(verify) and at least initially seemed somewhat restricted to Apple products, and (perhaps so) UPnP is better supported.(verify)


See also:

IPv4 multicast

IP-level implementation of the more general multicast idea


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


IGMP

Internet Group Management Protocol (IGMP) is how devices and multicast-aware switches make IUPv4 multicast actually go.



ICMP

Meanings of common messages

ICMP (Internet Control Message Protocol) is essentially a meta-protocol for IP that signals some common problems, including physical, routing and configuration problems.


The meaning of some common user messages that come in via ICMP (there are a bunch more, see e.g. [2]):

  • Net Unreachable (from a router)
basically means a router along the way there is no applicable route for the address
meaning it don't know where to send a packet to get it to its target network.
If you know the host is up, but you get this error, it usually means a routing table somewhere is faulty - and some (or all) hosts elsewhere can't get to it
If you get this on basically all addresses, on your home network, this often means you do not have a default route/gateway set, or that gateway is not properly set up. (...which is basically the same case, but in the other direction)
  • Host Unreachable (from a router) - the route to the network the host is in is known, but the host within it cannot be seen by whatever node is returning this.
The host may simply have recently gone offline, may be purposefully unresponsive(verify). One reason is ARP lookup failure(verify).
This itself can be caused by bad subnet/routing config, when it means the router thinks it is routing to the correct subnet, but the host does not respond in just that(verify)).
  • Protocol Unreachable (from the target host(verify))
Assuming you're using IP, it often means a very minimal device that has omitted support for either UDP or TCP, and is signaling this.
  • Port Unreachable (from the target host(verify))
Host is reachable and the protocol responds, but the requested port is not responding
probably either because nothing is listening/accepting on it, or the firewall is blocking this port.


The ICMP Cannot Fragment message happens when packets were sent with the 'do not fragment' flag set, but a router knows (according to the applicable route) it has to fragment.

Unnecessary fragmenting can easily happen e.g. when you tunnel things, such as with IPSEC, or your internet connection is PPPoA or PPPoE(PPP over ATM or Ethernet, respectively).

To minimize unnecessary fragmenting, things like Path MTU discovery discovers the largest MTU in a path basically by trying a number of sizes (and 'do not fragment' flag set), and seeing when this message stops happening.

TCP

TCP windows

TCP needs its packets to be acknowledged, as part of guaranteed delivery.

There is a maximum amount of unacknowledged data (correlates to unacknowledged packets) that can be in flight at one time.

This makes it a sliding window protocol with extra bookkeeping at both ends.

And has some implications.


tl;dr:

  • if you want throughput, both windows need to be large enough. "Large enough" depends on the physical bandwidth and a typical connection's RTT, see BDP
  • the "just increase all the numbers" approach has its limits, in that this can cause things to stutter more on congestion (verify)
  • the autotuning we now have works well.
you may wish to increase the bounds it works in if you have higher-BDP links. Look at net.core.rmem_max, net.core.wmem_max, net.core.rmem_default, net.core.wmem_default, net.core.optmem_max, net.ipv4.tcp_rmem, net.ipv4.tcp_wmem


  • Some software may its own too-small buffers, meaning larger windows won't help their speed
(includes old OpenSSH(verify), old Samba(verify))


The below mostly focuses mostly on how the window, and its settings, could limit throughput - of individual TCP connections, because each connection has its own window, so enough separate connections will typically balance each other, and at some point combine to saturate your interface.


The send and receive windows

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


The transmit window stores packet data until it gets a corresponding ACK packet from the receiving end - and not earlier, should retransmission be necessary.

When full, the TCP connection's sending will stall until there is space in the window again.

a transmit window can be too small for a particular end-to-end connection in the sense that the window fills up due to the acknowledgment coming in steadily but relatively late, meaning sends stall for a reason other than using the bandwidth that is available.

Too large is less serious: it amounts to it using that buffer as a queue for packets that are not yet in flight, which only this takes more (kernel) memory than necessary. There's a little more to it than that, though.


The receive window (RWIN) stores incoming packets. The receiving end's network stack (in the kernel) needs to do two things: ACKnowledge the packet(s) to the sender, and move the data into the application that asked for it.


When the receive window is very small, it may not move out the data fast enough, which would mean the window fills up and further incoming packets can only be dropped, which from the sender's view means it stops ACKing, resulting in congestion control and retransmission -- which looks like a lower speed, stop-and-go flow, and wastes bandwidth.

Also, the application itself pulls that data, so if it spends a lot of time doing other things instead, this can effectively have the same effect as too small a receive window. That said, a lot of the time and for a lot of cases, applications can move it faster than the network can.


Real-world behaviour and speed

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Some relatively fixed things


Latency has some quantifiable bounds because of physics,

particularly the speed of light:
a round-trip time over the atlantic ocean can't do much better much better than ~30ms (assuming 4500km to cross).
most satellite uplinks will have an RTT of at least 250ms (for geostationary orbits; starlink's low earth orbit could in theory do that 20ms they no longer seem to aim for or promise, but expect at least 80ms average in practice, and there are other limits)
Also, the physics of cables - very roughly, assume that fiber and copper go at around 0.7 times the speed of light(verify).
Also, switching overhead is a thing (~=copying a packet from one table to another). It's fast, but not instant.
Also, broadband often seems to add 5-10ms to the RTT just getting down the street over a different medium.(verify)



Other quantification

There is an estimation based on bandwidth-delay product (BDP), a term which is literal:

link_bandwidth * round_trip_time

That multiplication is easily a large number when

link speeds are high (e.g. gigabit, fiber),
the delay is high (e.g. satellite's 250+ ms),
and especially both (e.g. satellite)

For end users, the internet has highish BDP -- sometimes termed a Long-Fat Network[3] -- because while bandwidth may be high throughout, end-to-end latency quickly becomes a limiting factor.


TCP windows, RTT, and bandwidth

The TCP window is roughly the amount of unacknowledged data, assumed currently in flight somewhere.

If this is limited, that limits the speed. For the same sized window, higher latency means slower acknowledgment, and you can imaging that means lower transfer speed.

For high-BDP links, the size of the TCP receive window easily keeps a connection's transfer slower than it could be.

Window size is communicated.

In TCP it originally was a 16-bit field use as-is, meaning a maximum 64KiB, which roughly means

  • 250ms latency: 2MBit/s (≈250KB/s) (e.g. a classic satellite connection)
  • 150ms latency: 3.5MBit/s (≈400KByte/s)
  • 50ms latency: 10MBit/s (≈1.3MByte/s) (e.g. a not so good internet connection)
  • 15ms latency: 35MBit/s (≈4.5MByte/s) (e.g. a good internet connection)
  • 5ms latency: 100MBit/s (≈13MByte/s)
  • 0.5ms latency: 1000MBit/s (≈130MByte/s) (e.g. home LAN)


...which is why that 64K window was long ago worked around using TCP window scaling, a backwards-compatible convention applied to the TCP handshake that lets you say "so when I say size X, I actually mean something bigger" ("I actually mean a (two-power) multiple(verify) of what I say as the window size.")

Most modern TCP stacks implement window scaling. (Some older routers don't understand window scaling very well, which can do more harm than good to your connection. These problems are now fairly rare, so you can usually safely enable this, and you won't see slowdown due to small TCP window sizes.)


"why not just increase that window? Wouldn't high window size always be better?"

Setting buffer/window sizes very high may help maximum speed. Yet setting sensible values for a specific link is often preferable.

If congestion or packet loss weren't a thing, then very high settings are harmless.

Congestion and packet loss are a thing, though. Different protocols make different tradeoffs. Around TCP, there is some tradeoff between high throughput, and consistency of responses under packet loss.

The best tradeoffs around congestion is a whole thing, though - we are teetering close to some deeper network theory that consists of just a lot of jargon in a row.

Practically, though, around TCP, setting a realistic RWIN will often see the same throughput as a larger RWIN, but with more stable/predictable RTT jitter(verify), which can matter.

Also, you may want to react to loss/congestion. This is why receive-window auto-tuning is a good idea.


Delayed, Cumulative, and Selective ACKs

RWIN and the MSS

Linux tweaking practive

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Congestion avoidance

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


TCP slow start

https://en.wikipedia.org/wiki/TCP_congestion_control#Slow_start

Interesting TCP states

See TCP state diagrams: (google search).


For a quick and dirty summary of yours, per protocol (except for unix sockets):

netstat -pna | egrep '^(tcp|udp)' | \
  sed -r 's/[- ]{2,}/ /g' | cut -d ' ' -f 1,6 | sed -r 's/(udp).*/\1/' | \
  sort | uniq -c | sort -rn

Opening

TCP handhake
TCP: Possible SYN flooding on port 80. Sending cookies
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


As mentioned above, the TCP's handhake is c→s:SYN, c←s:SYN+ACK, c→s:ACK.

If a client starts startup sends only the SYN request, but does not send that third step's ACK, then the server keeps these half-open connections around.


A networking stack will place a limit to connections in that state, and once reached, denies new connections.

This typically happens for one of two reasons:

  • Very busy servers, seeing a very high amount of legitimate connections (particularly when most connections are also very short-lived)
you can consider
increasing net.ipv4.tcp_max_syn_backlog, the global max (verify), from the default 512 or 1024 (for memory-restricted machines) to e.g. 4096
increasing net.core.somaxconn, which seems to be the per-port max backlog (verify), from the default 128 to e.g. 1024
Keep in mind specific programs can request (/be compiled with?(verify)) another value
  • A SYN flood - sometimes a buggy program, but often someone intentionally sending a mass of such initial packets
...in the hope of effectively being a Denial of Service attack
SYN cookies are a useful way to deal with that.
It's a clever way to not queue all requests, but can still deal with real clients's completing handshake. (...and while it does not violate specs, it does have some caveats, which is why you would not want this as default behaviour, but is very nice fallback while under attack)
note that overly port scans can also do this

Closing

Notes:

  • Closing a connection is a cooperative thing between both sides.
in other words, TCP requires both ends to agree for the connection to be completely closed
  • You get TIME_WAIT if the local side closes first, and CLOSE_WAIT if the remote end closes first.
  • A half-open (only one side closed) connection is a perfectly valid state


CLOSE_WAIT

See e.g. this diagram


CLOSE_WAIT is a specific stage in that.

Basically:

The other end has sent us a FIN, to signal it considers the connection closed - an "I have no more data to send"
Our side has acknowledged this one-directional close.
Our side has not (yet) decided to close from this side.


In most real-world cases, most protocols on top

will say how to explicitly agree to close (e.g. might have sent data signifying a close to the server) and now send form of "okay, bye now, you can close too" message

AND/OR

will, via protocol specs, say when a close should be done by implication (see e.g. details to HTTP 1.0 and 1.1 (non)-persistent connections).


...if not, then our side still has a connection open.

At TCP level such a one-way connection is perfectly valid, sometimes even useful.

So the TCP stack will not close it automatically - there are no related TCP timeouts.


If intentional, both sides will still have the socket, and we can continue sending on it in the leftover direction.

If this is not intentional, the the other side has left, but our local side thinks is one-sided, and will realize it's dead only once we next try to send on it, at which point the other side's network stack says "erm, that socket doesn't even exist", which might be be microseconds or weeks later.


In some cases you may know a specific CLOSE_WAIT connection will not communicate, such as when the protocol on top is purely query-response (like HTTP 1.x), but it's up to that protocol to be proper about connection closes. It is effectively a minor flaw in the protocol specs to not have specified that that connection close happen as early as sensible for it.


Seeing a small fraction of CLOSE_WAIT sockets may be okay, simply because even when the protocol deals with it some way or other (specced timeouts, heartbeat packets, etc), it may take some time. Even when well defined and short, having a lot of connections means you'll always have a few in this state, and that's expected.

You may consider increase the maximum open sockets, but this is otherwise fine.


Seeing many sockets in CLOSE_WAIT usually points to them not being cleaned up within sensible time, and you should complain to the respective programmer.


TIME_WAIT

A stage in a closing TCP connection. See e.g. this diagram


Our side initiated the connection close.

After that, the socket is intentionally kept in the TIME_WAIT state before being removed, officially for twice the Maximum Segment Lifetime (MSL), where the MSL is usually something like 120, 90, 60, or 30 seconds.


The purposes seem to be:

  • avoid duplicate segments to be accepted from an unrelated connection
that takes a relatively pathological situation:
same socket, in the 5-tuple sense: (proto, local_addr, local_port, remote_addr, remote_port)).
where at least one of those port choices is likely to come from the host's ephemeral ports pool (so comes from an incremental or random choice in a range of thousands or tens of thousands)
in practice the same sequence number as well(verify)
making it statistically very unlikely to happen
...but is technically possible to happen accidentally, and a little likelier to do purposefully, so the network stack plays it rather safe by default.
  • (also to discard segments that come in after we close a connection, but that's implied by there not being a connection (verify))
  • avoid the other end sticking in a half-finished teardown for a while (verify)



Good network stacks can easily deal with thousands of such sockets in soon-to-disappear states with barely measurable performance differences(verify), and a socket takes on the order of 1.5KB in memory, so these TIME_WAIT sockets usually have very little impact on resources or performance.


If you expect to see so many short term connections that you can run into some limits, you can:

  • allocate more ports to be /proc/sys/net/ipv4/ip_local_port_range
...some systems seem to allocate ~4k by default, others ~30k. If the latter is not enough, think about what you're doing.
  • tell the OS to remove these sockets faster
...but be aware that this is a game of weighing probability with likely client behaviour.
On linux, you can set the time sock a socket will linger (note: this is not the MSL itself) by setting tcp_fin_timeout:
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
sysctl -w net.ipv4.tcp_fin_timeout=30
  • tell the OS TIME_WAIT sockets can be reused immediately under certain near-safe conditions
e.g. linux can be told to do so only when the new timestamp is bigger than the most recent timestamp - as that avoids the duplicate-segment case
See net.ipv4.tcp_tw_reuse
however, some load balancers and firewalls are known to expect more compliant behaviour, and may reject such reuse (See also RFC 1122)


Effect on restarting services

If you have a daemon listening on a specific port, and that was left in TIME_WAIT, this amounts to having to wait until you can restart a server on the same port.

Note that if a listening connection has no connections, the port can be safely closed and re-used immediately by the same sort of server. Take a look at SO_REUSEADDR (basically, if the local_address/port combination you try to bind to is in TIME_WAIT, you can reuse it. If it is in another state (probably 'in use'), it will still fail).

See also http://hea-www.harvard.edu/~fine/Tech/addrinuse.html


TCP_NODELAY and Nagle's Algorithm

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Without Nagle's algorithm, each TCP socket write() becomes a packet on the network (or, if larger than a single packet can be, multiple packets).


This means data is sent as soon as possible, which is great for interactive things.

For example, remote shells (like SSH) and game netcode probably want Nagle off.


The worst case behaviour is sort of bad, though. If you manage to write() single bytes, then every TCP packet will contain 1 byte of payload, wrapped in over forty bytes of networking header, and each packet needs its ow acknowledgement packet as well.

If these packets happen at the rate that you hit your keyboard at, then no one much will notice or care, but if you happen to send a lot of data with small writes

However, if a program is sending things in unnecessarily small chunks, not having Nagle is somewhat inefficient:

  • it can lead to unnecessarily high NIC interrupt load (and implied CPU load, probably kernel time)
which is one among a few reasons you may never reach the top NIC speed
  • you can more easily congest (e.g. oldschool POTS modems could congest more easily even when you're barely sending any (user) data. ...but this is rarely relevant anymore)
  • you can slow down just your own connection due to limited amount of packets in flight (read up on ACKing strategiesm, TCP windows)
which also lowers your transfer speeds
  • switches doing more routing decisions (leads to slightly higher latency), more packet overhead (sizewise), and some other overheads

Note that most of these effects are small.


With Nagle's algorithm, smaller writes are coalesced small writes into larger packets - also meaning fewer packets (and fewer ACKs).

Nagle is basically it's a fix for network code that use smaller write than they should, and then can actually increase bandwidth for moderately bulky transfers but increase latency for interactive stuff.

It's on by default.


It does bother many realtime systems, which is why you might want to turn it off.

It can also interact badly with delayed-ACK mechanisms - it apparently causes high latency in some circumstances.(verify)




More specifically

Nagle's algorithm says you may not send a runt if you currently have unacknowledged data in flight.

That definition means it must wait until it either

all data in flight is ACKnowledged (this also acts as a sort of timeout)
it has enough data that it could't cram more into a single packet (relates to MSS)

This means a mass of small writes generally get coalesced into a single packet, and isolated small writes get written soon enough.


Note that:

  • a one-sided transfer, e.g. a download, will have its last packet delayed longer than the rest (until the ACKs happen)
benchmarkers beware.
  • Nagle is not necessary for any app which is aware it should send large packets
e.g. those that send things in larger chunks
  • Nagle is counterproductive for latency-sensitive things
if small writes must arrive as soon as possible, the app will probably disable it.
TCP_NODELAY disables Nagle's algorithm
  • a good example of the previous is remote shells
yes, they may send single bytes for keypresses
Yet it's the fastest way to get that key acknowledged and on the screen,
and the latest network device that cared about rate of packets at typing speed was roughly the 14.4K POTS modem.(verify)


Note that even in the cases where it helps in general, the last packet will tend to sit on the sending side longer than the rest - until the previous gets ACKed(verify). Which is soon enough (longer with delayed ACKs, see below).






See also:

-->

UDP

https://en.wikipedia.org/wiki/Internet_Group_Management_Protocol

Connection errors

Connection refused