Networking notes - IP related

From Helpful
(Redirected from Private network)
Jump to: navigation, search

For other network related things, see:


DHCP notes

See also:

  • RFC 1541, Dynamic Host Configuration Protocol (deprecated by 2131)
  • RFC 2131, Dynamic Host Configuration Protocol updated by:


There is a POSIX socket API, evolved from an earlier Berkeley.

Technically it was a standardization of early TCP, that also allowed wider set of protocols.

other sockety things

Local sockets / unix domain sockets


BSD calls it PF_UNIX, which is #define'd to the same thing.

Both ends of such a socket connect, via code[1], to what looks like a file on the filesystem.

On the code side it presents a fairly normal socket interface - you can e.g. send() and recv()

Security is mostly the filesystem permissions to that socket file.

Since this is is essentially its own tiny subsystem, it fundamentally cannot be network-routed. Since it does not involve IP, there are no ports

Note this is distinct from localhost IP, which is just a special case - but still within the network stack.

While this is also not routable, that's because of the network stack refusing to do so - but since that network stack is still involved, local sockets are simpler and a little faster in comparison.

This makes them useful for local IPC (both because it's fast and you can't accidentally open it up to the network with the wrong listen() call)

Note that it is also distinct from named pipes.

See also:

Abstract sockets

Abstract sockets are local sockets (AF_LOCAL/AF_UNIX, see previous section) but in the special-cased abstract namespace.

The abstract namespace basically means that it lets you bind sockets to names that are managed via the kernel, rather than filesnames on a filesystem.

The way this is implemented in linux allows you to create a local socket as per usual AF_LOCAL way, when you hand in a path string that starts with a null byte.

So basically, you bind() or connect() to something like "\x00myname"


  • don't have to avoid existing socket files on filesystem
  • don't have to clean up socket files (kernel removes them when last reference closes)


  • no permissions
  • linux-only

This makes them useful for IPC on linux, but only when trusting everything on the host.

Support: CHECKME

at the very least 2.6 kernels, and enabled.
recent kernels can be assumed to have it(verify), namespaces have been important for a while.

See also:

Named sockets are an ambiguous name, in that some people use them to refer to abstract sockets, some to named pipes.

Named pipes

A special kind of filesystem entry.

Similar to domain sockets in that they are a filesystem thing.

Different in that they do not present a socket interface at all, just a FIFO-style bytestream, to whatever other process also opens it.

Created with
(syscall and utility share the same name)

Addresses and nets

IPv4 network notation

The following are the same:

  • (CIDR notation)
  • 192.0.2/24
  • 192.0.2.

192.0.2/24 is an example of the CIDR-without-trailing-zeroes. Similarly, 10/8 means

192.168. or 127. is another lazy variant, but less powerful as it's basically a start-of-string match, and implicitly octet-boundary-only.

Applications vary in which they support. Usually they stick to the first one or two.

These examples all mention masks/subnets split at whole octets. When you refer to subnets, you say how many (leftmost) bits matter to its definition.

Depending on what you are configuring, you can specify both (sub)net address and a netmask, such as The application you hand this to may require you to have the IP have the rightmost bits cleared, as in that example (29 having a 3-bit host part, and 168 being 10101000). The rest of the bits are used by things on that net, so don't matter to the net specification.

IPv4 special addresses

Main relevant RFCs:

  • RFC3330, 'Special-Use IPv4 Addresses' (refers to most other RFCs mentioned here)
  • RFC1918, 'Address Allocation for Private Internets'
  • RFC1700, 'Assigned Numbers'

Well known special cases (see mainly RFC 1700 for these):

  • is for 'limited broadcast', in that it should not be routed between nets. (technically in 240/4, see below)
  • is regularly used with a meaning like 'any local IP' (a sort of 'this host'), particularly when configuring what IP(s)/adapters a service should listen on.
  • 127/8: Anything to this network should go to the local host, without going to a network card
It is common convention to configure on a virtual loopback device.
This net is typically used for local-only uses such as IPC and some services.
can be served via domain sockets, which you can use as network sockets but are cheaper than putting things on the full network stack since the kernel knows beforehand it's local-only.

Private networks

The private network rangesare defined by RFC 1918:

  •, that is, through
  •, that is, through
  •, that is, through

These are private in that these will not be routed onto the public internet.

This means you can use these networks at will, with no fear of conflicting with anyone else who does.

To get hosts on IPv4 private nets to talk to the internet, you will need a gateway (which at home will be your modem).

To the rest of the world, you look like you come from your gateway(e.g. modem)'s public IP.

They were defined to delay the exhaustion of public IPv4 addresses. While these ~17 million addresses cannot exist on the internet, none of the hosts on a private net doesn't count towards the use of public IP addresses.

Which is why public nets are very commonly used for the home LAN behind your modem. And other cases where you case that hosts can reach the internet, but the internet doesn't need to initiate connections with you directly.


  • Some of the above statements have exceptions - see e.g. the subnets used in VPN, what VLANs do to subnets, and more.
  • This is mostly an IPv4 implementation concept, in IPv6 it's a different story.

Other special cases

Other special cases:

  • 240/4 ( "Reserved for future use"
  • 192.0.2/24: intended for examples in documentation and code, and should not be used on the internet (nor should it be confused with 192.168/16)
  • 198.18/15: to be used for bandwidth tests (see RFC2544)

There are various ranges reserved by IANA but not used (and, sometimes, unlikely to be used for the purpose they seem likely to be reserved to).

This includes 5/8 (which hamachi used) and various others, apparently including 128/8, 191.255/16, 192.0.0/24, 223.255.255/24 and the already-mentioned 240/4.

Also related:

Link-only network

The 169.254/16 link-only network is an unroutable network described by RFC3927.

(Microsoft has its own name for it, Automatic Private IP Addressing (APIPA))

These addresses are assigned when a host configured to use DHCP doesn't get a response. (Note: it may keep trying to reach DHCP while such an IP is configured, it may not).

The idea is roughly that every computer with thisbehaviour this will end up on the same subnet (and it being a random choice in a 64K-large range means they will usually not pick the same IP), and so can communicate with others on the same segment but no further (this subnet is non-routeabe), and that that could be more useful than nothing.

In practice it often isn't that useful, but I've seen it used to copy a few files.

See also:

Bogons and martians

Bogons refer to packets that claim to come from networks that are currently unused (bogon space).

Martians are packets that claim to come from unroutable nets.

Both are good indicators that the traffic is spoofed, or sometimes that a router is very dumb or malfunctioning.

See also:

Subnetting(, routing)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Routed networking is largely concerned not with nodes or hosts, but with the network ranges they are on: if routers can put a packet on the net that the target host is on, switches and hosts themselves take care of the last step.

An IP (sub)network is often written/stored as a combination of an address combined with a mask that implies a range of addresses it is in.

From such a (IP_on_a_subnet, subnet_mask) pair you can derive the size of the net, the network address, the broadcast address, and the addresses that may be hosts.

(When referring to a network, you usually use the network address as the IP to mention in that IP/mask pair, e.g., but in a lot of practical cases, any IP in that range (say, will work equivalently, exactly because network code can derive the network address with just one bitmask operation)

There are different equivalent ways to refer to a network:  (a little more reflective of the network stack logic)               (equivalent, shorter for us to write)

The /number refers to how many leftmost bits are set (Technically, consecutive set bits are not strictly necessary in (sub)networks, but there is more confusion than added value to non-consecutive-bit network masking).

You can also mention them by start and end address, although this is more for reporting than anything else. It can be convenient for human checks when subnets don't start or end on byte edges (anything except /0. /8, /16, /24, and /32). For example, ' though' makes more immediate sense to me than

When splitting a given net into subnets, you could think of it as using the bits for three different things: the existing network, the new subnetwork, and the host bits.

Note: this is only an aid to modelling, as there is no hard difference between the network and subnetwork bits, but I'd say it helps.

For example, if you take the private network ( through and want to use it as 256 separate /24 networks (,,, etc.), on a bit level you can think of it as:

nnnnnnnn nnnnnnnn ssssssss hhhhhhhh

Note that if one host has the right IP but the wrong netmask (such as when it should be, this may go weird. Directed traffic may still work, but that hosts's broadcasts will probably not.

The earlier note about "routing is towards networks, not nodes" is true mostly because of what routing tables contain. What they contain also implies what sort of address the source/target is. For example, a routing tables may look something like:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface U     0      0        0 eth0   U     0      0        0 eth1       U     0      0        0 ham0       U     0      0        0 lo         UG    0      0        0 eth0

When deciding to what interface a packet should go, the packet's destination IP is tested against each entry (look at destination/genmask). The first rule that applies is used.

The last entry is often the default gateway, in an "anything that wasn't matched by the above is routed here" sort of rule.

This is handy in home networks, because that will include all of the public internet, and you know that your gateway will be doing the right thing (NAT) on your behalf.

See also:


Switches, routers, hubs, bridges, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

NAT itself refers to translating adresses. It can be a somewhat confusing subject.

  • Full NAT is stateless, and somewhat like routing except that it rewrites packets to the end of making networks appear differently
  • SNAT is stateful, and does half of full nat, rewriting source addresses/ports
  • DNAT is stateful, and does the other half, rewriting target addresses/ports.

(So yes, full stateless NAT can be imitated with two stateful NATs)

Note that you also need ip forwarding on for most of this, and some distros have it off by default, in which case you need to:

echo 1 > /proc/sys/net/ipv4/ip_forward

...or, actually, figure out which configuration file controls the system's doing so at bootup.

SNAT and masquerding

In practice, SNAT is often used to assist sharing an internet connection in a home network.

In this case, there is one gateway host that does the SNATting. It takes connections from the local net, rewrites the source IP so that the response will come back to it. Since it remembers where the connection came from, the data it gets back can easily be sent where the connection actually originated (which is a DNAT step implicit in every SNAT: The packet's destination is rewritten back to the local net (iptables remembered this) and the packet gets back to the original node).

Where PUBLIC_IP is the SNATting node's outside, public IP):

iptables -t nat -A POSTROUTING -o ppp0 -j SNAT --to $PUBLIC_IP

Masquerade, also called dynamic NAT (sometimes dNAT, but usually ot since it's too easily confused with DNAT) is an old version of using SNAT this way. It would automatically picks the interface's IP for the new source IP. The following is similar to the SNAT rule above.

iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE


Using DNAT by itself means rewriting things that come in.

It is probably mostly used for port forwarding: if you e.g. have a routing DSL modem and want to connect to an internal computer's remote desktop, web server, or whatever else, use rules like:

iptables -t nat -A PREROUTING -d $externalip -p tcp -m tcp -dport 3389 -j DNAT --to-destination $internalip:3389 
iptables -t nat -A PREROUTING -d $externalip -p tcp -m tcp -dport 80   -j DNAT --to-destination $internalip:80 

...assuming said modem takes such rules somehow, it is connected to a computer that does so for it. The logic's sound.

You can also use DNAT to p


See also:

UPnP, IGD, NAT-PMP, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Two different protocols with the same intent: negotiating automatic port forwarding, so that manually configuring modem DNAT for port forwarding becomes unnecessary.

UPnP implements IGD and is Microsoft's implementation, NAT-PMP is Apple's. The latter came later(verify) and at least initially seemed somewhat restricted to Apple products, and (perhaps so) UPnP is better supported.(verify)

See also:

IPv6 notes

If it's existed for 20 years, are we using it yet?

IPv6 address notation

IPv6 addresses are 128-bit numbers. RFC 1884 specified that everything should recognize the following forms:

  • full form: eight chunks of 16-bit hexadecimal numbers, like
  • abbreviated form, based on the two rules that:
    • leading zeroes in a block can be removed. 0000 can be 0, 0123 can be 123, etc.
    • :: means "fill with 0000 blocks" which can appear at any point (beginning, middle, end), but at most once (and then typically/always the leftmost(verify)), to avoid ambiguities
  • alternative form: a convenience form in which you specify the last two 16-bit blocks as four decimal numbers (for specifying IPv4 addresses in IPv6 notation) while leaving the rest as hex:
  which can be abbreviated as
  and is also just another way of writing

IPv6 special addresses and ranges

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
IP/Network           Purpose
(abbreviated, CIDR)  
-------------------- --------------
::/8                 Unassigned, but contains special addresses:
::1/128              * loopback
::/128               * unspecified/any  (all zeroes, like in IPv4)
::/96                * IPv4 compatible addresses (obsolete)
::ffff:0:0/96        * IPv4-mappable addresses

FE80::/9             Private ranges:
FE80::/10            * link-local unicast 
                       (similar to IPv4's
FEC0::/10            * site-local unicast
                       (similar to IPv4's 10/8, 172.16/12 and 192.168/16)

FC00::/7             Unique Local Addresses, similar in purpose to FECO::/10
                     This includes two ranges, FC00::/8 and FD00::/8, 
                     which can be used for /48 subnets 
                     that are only site-routable
                     (40 bits are to be filled in randomly, to avoid 
                      collision should such nets ever merge)
2002:0000:/16        6to4

FF00::/8             Multicast

2::/7                Reserved for NSAP address allocation

...and a number of unassigned ranges.

See also:

IPv6 addresses may be 128-bit, but the amount of things you hand out is actually more like 48-bit

IPv6 general use addresses and ranges

IPv4-mapped addresses, 6to4, and Teredo

Connection errors

Connection refused


TCP windows

TCP needs its packets to be acknowledged, as part of guaranteed delivery.

This makes it a sliding window protocol with extra bookkeeping at both ends.


  • if you want throughput, both windows need to be large enough. "Large enough" depends on the physical bandwidth and a typical connection's RTT, see BDP
  • the "just increase all the numbers" approach has its limits, in that this can cause things to stutter more on congestion (verify)
  • the autotuning we now have works well.
you may wish to increase the bounds it works in if you have higher-BDP links. Look at net.core.rmem_max, net.core.wmem_max, net.core.rmem_default, net.core.wmem_default, net.core.optmem_max, net.ipv4.tcp_rmem, net.ipv4.tcp_wmem

  • Some software may its own too-small buffers, meaning larger windows won't help their speed
(includes old OpenSSH(verify), old Samba(verify))

The below mostly focuses mostly on how the window, and its settings, could limit throughput - of individual TCP connections, because each connection has its own window, so enough separate connections will typically balance each other, and at some point combine to saturate your interface.

The send and receive windows

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The transmit window stores packets until it gets a corresponding ACK packet from the receiving end - and not earlier, should retransmission be necessary.

When full, the TCP connection will stall until there is space in the window again.

Too small a transmit window (for a particular end-to-end connection) is the case where the window fills up due to higher latency, and it stalls and ends up less than the bandwidth there actually is.

Too large is less serious: If the transmitting side stores more packets than will be in flight, this takes more (kernel) memory than necessary, and if things stutter, they may stutter slightly harder (why?(verify)), though rarely in a serious way.

The receive window (RWIN) stores incoming packets. The receiving end's network stack (in the kernel) needs to do two things: acknowledge the packet(s), and move the data into the application layer.

When the receive window is very small, it may not move out the data fast enough, which would mean the window fills up and further incoming packets can only be dropped, which from the sender's view means it stops ACKing, resulting in congestion control and retransmission -- which looks like a lower speed, stop-and-go flow, and wastes bandwidth.

Slightly too small can limit transfer speed somewhat, because the receiving end will send ACKs which signal how much space is currently in its RWIN, so that the sending side can tune its transmission speed somewhat.

(In theory, delays in moving data out of the receive window to the app also means RWIN can fill up faster. The TCP standard doesn't say much about how soon to move data, though you can generally assume it's faster than your network is)

(you can use wireshark or such to analyse this. In its case, look for tcp.analysis.window_update)

The proper value for both relate both to typical bandwidth and typical round-trip time.

The window size will typically adjust to a healthy value, though(verify)

Real-world behaviour and speed

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

When you want to use all physical bandwidth on a no-loss local-ish connection, you want

  • the sending side to spend no time waiting on the required ACKing before it can continue, and
  • the receiving end to be large enough to handle the bulk such a sender can send.

You can rarely change bandwidth or delay (...other than dedicating a physical line), so this usually means buffer size tweaks.

Roughly speaking, you'ld want that buffer to be the size of unacknowledged data, which is basically the BDP.

The bandwidth-delay product (BDP):

link_bandwidth * round_trip_time

That multiplication is easily a large number when

link speeds are high (e.g. gigabit, fiber),
the delay is high (e.g. satellite's ~500ms),
and especially both (e.g. satellite)

Sometimes there are quantifiable bounds, e.g. the speed of light (and that fiber and copper are effectively ~0.7c) says trans-ocean round-trip time can't be much better than ~30ms, and at least ~250ms for satellites in typical orbits.

Setting buffer/window sizes very high may help, but setting things to a sane numbers is often better.

In theory very high settings are harmless, but if there is expectable congestion or packet loss, and you do not (or cannot) use Selective ACKs, then you can at best use Cumulative ACKs - which do not deal as well with even small amounts of packet loss. In this case, a well-calculated RWINs will often mean the same speed as a larger one - but with more stable/predictable RTT jitter, which can matter.

As such, RWIN is best tuned to the actual BDP of the connection, and react to loss/congestion. This is why receive-window auto-tuning is a good idea.

For end users, the internet is a Long-Fat Network, because while bandwidth may be high throughout, end-to-end latency quickly becomes a limiting factor.

This basically means that for high-BDP links, the size of the TCP receive window easily keeps a single connection slower than it could be. (This is one cause of the "single-connection-is-slow, many-connections-can-saturate" effect, but not the only. Another common cause is that a protocol spends time doing work between actual sends. The higher the speed of sending, the more that matters, relatively speaking)

Window size is communicated. In TCP it is a 16-bit field that originally could only indicate 64K, which now often isn't enough for good performance - see the following table

Without window scaling, assuming the maximum 64KiB,...

  • 250ms latency: 2MBit/s (≈250KB/s)
  • 150ms latency: 3.5MBit/s (≈400KByte/s)
  • 50ms latency: 10MBit/s (≈1.3MByte/s) (e.g. a not so good internet connection)
  • 15ms latency: 35MBit/s (≈4.5MByte/s) (e.g. a good internet connection)
  • 5ms latency: 100MBit/s (≈13MByte/s)
  • 0.5ms latency: 1000MBit/s (≈130MByte/s) (e.g. home LAN)

And yes, you can take that roughly as a "for this speed I need lower than this latency" list.

...though windows apply per connection, so these speeds apply per connection. More connections are often an easy scale-up (until you saturate a link somewhere).

For this reason, you typically want to use TCP window scaling, which is a convention/trick applied to the TCP handshake that lets you say "I actually mean a two-power multiple(verify) of this."

Most TCP stacks (linux and otherwise(verify)) implement window scaling.

Some older routers don't understand window scaling very well, which can do more harm than good to your connection. These problems are now fairly rare, so you can usually safely enable this, and you won't see slowdown due to small TCP window sizes.

Delayed, Cumulative, and Selective ACKs

RWIN and the MSS

Linux tweaking practive

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Congestion avoidance

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Interesting TCP states

See TCP state diagrams: (google search).

For a quick and dirty summary of yours, per protocol (except for unix sockets):

netstat -pna | egrep '^(tcp|udp)' | \
  sed -r 's/[- ]{2,}/ /g' | cut -d ' ' -f 1,6 | sed -r 's/(udp).*/\1/' | \
  sort | uniq -c | sort -rn


TCP: Possible SYN flooding on port 80. Sending cookies
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The SYN flag (in the TCP header) is only used at the start of a new connection, part of the three-way handshake, (SYN request, SYN/ACK response, ACK and the connection is established).

If a client does sends only the request, but not the final acknowledgment, the server keep these half-open connections around. Once the limit of these half-open connections is reached, it has to deny new connections.

This happens for one of two reasons:

  • A SYN flood - sometimes a buggy program, but often someone intentionally sending a mass of such initial packets, effectively a Denial of Service attack.
SYN cookies are a useful way to deal with that. It's a clever way to not queue all requests, but can still deal with real clients's completing handshake. (...and while it does not violate specs it does have some caveats, which is why you would not want this as default behaviour, but is very nice fallback while under attack)
  • Very busy servers, seeing a lot of legitimate connections (particularly with very short-lived connections)
you can consider
increasing net.ipv4.tcp_max_syn_backlog, the global max (verify), from the default 512 or 1024 (for memory-restricted machines) to e.g. 4096
increasing net.core.somaxconn, which seems to be the per-port max backlog (verify), from the default 128 to e.g. 1024
Keep in mind specific programs can request (/be compiled with?(verify)) another value


Our side, acting as a client, has tried to set up a connection and sent SYN, but have not received ACK.

Since the remote side would often either complete the connection or reject it, this usually means a remote firewall that drops packets instead of rejecting them (...usually to make life a little harder for port scanners and DoS attacks, by having the client network stack leave a connection open for some time(verify))



  • TCP requires both ends to signal the close (via FIN packets)
  • You get TIME_WAIT if the local side closes first, and CLOSE_WAIT if the remote end closes first.
  • A half-open (only one side closed) connection is a perfectly valid state


Techincally means:

The other end has sent us a FIN, to signal it considers the connection closed.
The local side has acknowledged this one-directional close.
The local side has not (yet) decided closed the connection.

(In other words, closing is a three-way thing ,just as establishing one is)

So the remote has said "I have no more data to send" (which is what CLOSE means).

In most cases, the protocols will also have just sent some form of "ok bye now (you can close too)" message, and/or the protocol specs say when a close is implied (see e.g. [Webpage_performance_notes#Persistent_connections details to HTTP 1.0 and 1.1]).

If neither of those, the receiving side still has a connection open. At TCP level such a one-way connection is perfectly valid, and sometimes useful, so the TCP stack will not close it automatically - there are no related TCP timeouts.

In some cases you may know this is completely useless, e.g. when because the protocol on top is query-response), but it's up to that protocol to be proper about connections.

If this is not intentional, the the other side has left, but our local side thinks is one-sided, and will only realize it's dead if and when it tries to send on it (and the other side says "that connection doesn't even exist"), which could be microseconds or weeks later.

Seeing some may be okay, because it often means a close happens within predictable time, e.g. because of heartbeat packets, or specced timeouts. In some cases adding some timeout in your code fixes all.

It may be a good idea to increase the maximum open sockets, but this is otherwise fine.

Seeing many sockets in CLOSE_WAIT usually points to them not being cleaned up within sensible time, and you should complain to the respective programmer.


Our side initiated the connection close.

After that finishes, the socket is intentionally kept in the TIME_WAIT state before being removed, officially for twice the Maximum Segment Lifetime (MSL), where the MSL is usually something like 120, 90, 60, or 30 seconds.

The purposes seem to be:

  • avoid duplicate segments to be accepted from an unrelated connection
which is unlikely to happen as for that to happen the new connection would need to be the same socket in the 5-tuple sense: (proto, local_addr, local_port, remote_addr, remote_port)).
and in practice we could also use the timestamp, which makes it much less likely
  • (also to discard segments that come in after we close a connection, but that's implied by there not being a connection (verify))
  • avoid the other end sticking in a half-finished teardown for a while (verify)

Accepting a packet to the wrong connection takes a pretty pathological situation (same proto, local_addr, local_port, remote_addr, remote_port, and sequence), but is technically possible, so the network stack shouldn't ignore the possibility.

Good network stacks can easily deal with thousands of such sockets in soon-to-disappear states with barely measurable performance differences(verify), and a socket takes on the order of 1.5KB in memory, so these TIME_WAIT sockets usually have very little impact on resources or performance.

If you expect to see so many short term connections that you can run into some limits, you can:

  • allocate more ports to be /proc/sys/net/ipv4/ip_local_port_range
...some systems seem to allocate ~4k by default, others ~30k. If the latter is not enough, think about what you're doing.
  • tell the OS to remove these sockets faster
...but be aware that this is a game of weighing probability with likely client behaviour.
On linux, you can set the time sock a socket will linger (note: this is not the MSL itself) by setting tcp_fin_timeout:
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
sysctl -w net.ipv4.tcp_fin_timeout=30
  • tell the OS TIME_WAIT sockets can be reused immediately under certain near-safe conditions
e.g. linux can be told to do so only when the new timestamp is bigger than the most recent timestamp - as that avoids the duplicate-segment case
See net.ipv4.tcp_tw_reuse
however, some load balancers and firewalls are known to expect more compliant behaviour, and may reject such reuse (See also RFC 1122)

Effect on restarting services

If you have a daemon listening on a specific port, and that was left in TIME_WAIT, this amounts to having to wait until you can restart a server on the same port.

Note that if a listening connection has no connections, the port can be safely closed and re-used immediately by the same sort of server. Take a look at SO_REUSEADDR (basically, if the local_address/port combination you try to bind to is in TIME_WAIT, you can reuse it. If it is in another state (probably 'in use'), it will still fail).

See also

TCP_NODELAY and Nagle's Algorithm

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Meanings of common messages

ICMP (Internet Control Message Protocol) is essentially a meta-protocol for IP that signals some common problems, including physical, routing and configuration problems. The meaning of some common messages:

  • Net Unreachable (from a router) - Signals that there is no applicable route, meaning we don't know where to send a packet to get it to its target network.
    • If the host is up, but cannot be reached from a host outside the net it is on (or only from some), this usually means a routing table somewhere is faulty.
    • If you cannot get onto the internet on your home network, this often means you do not have a default route/gateway, or that gateway is not properly set up. (...which is basically the same case, but in the other direction)
  • Host Unreachable (from a router) - the route to the network the host is in is known, but the host within it cannot be seen by whatever node is returning this. The host may simply have recently gone offline, may be purposefully unresponsive(verify). One reason is ARP lookup failure(verify). This itself can be caused by bad subnet/routing config, when it means the router thinks it is routing to the correct subnet, but the host does not respond in just that(verify)).
  • Protocol Unreachable (from the target host(verify))
    • Assuming you're using IP, it often means a very minimal device that has omitted support for either UDP or TCP, and is signaling this.
  • Port Unreachable (from the target host(verify)) - Host is reachable and the protocol responds, but the requested port is not responding, probably either because nothing is listening/accepting on it, or the firewall is blocking this port.

There are more; see e.g. [3]

The ICMP Cannot Fragment message happens when packets were sent with the 'do not fragment' flag set, but a router has to fragment according to its route. Now commonly the result of path MTU discovery.

Fragmenting can be a problem when you tunnel things, such as with IPSEC, or your internet connection is PPPoA or PPPoE (PPP over ATM or Ethernet, respectively)