Networking notes - IP related

From Helpful
(Redirected from TIME WAIT)
Jump to: navigation, search

For other network related things, see:

Also:

DHCP notes

See also:

  • RFC 1541, Dynamic Host Configuration Protocol (deprecated by 2131)
  • RFC 2131, Dynamic Host Configuration Protocol updated by:

Sockets

There is a POSIX socket API, evolved from an earlier Berkeley.

Technically it was a standardization of early TCP, that also allowed wider set of protocols.

other sockety things

Local sockets / unix domain sockets

POSIX calls it PF_LOCAL

BSD calls it PF_UNIX, which is #define'd to the same thing.


Both ends of such a socket connect, via code[1] to what looks like a file on the filesystem. On the code side it presents a fairly normal socket interface (e.g. you can send(), recv()).

Security is mostly the filesystem permissions to that socket file.


Since this is is essentially its own tiny subsystem, it fundamentally cannot be network-routed. Since it does not involve IP, there are no ports

Note this is distinct from localhost IP, which is just a special case within the network stack. While this is also not routable, that's because of the network stack refusing to do so - but since that network stack is still involved, local sockets are simpler and a little faster in comparison.

This makes them useful for local IPC (both because it's fast and you can't accidentally open it up to the network with the wrong listen() call)


Note that it is also distinct from named pipes.


See also:

Abstract sockets

These are local sockets (AF_LOCAL/AF_UNIX) but in the abstract namespace.

The main difference is that it lets you bind sockets to names (known by the kernel), rather than files on the filesystem.

The way this is implemented allows this in linux is to create a local socket as per usual AF_LOCAL way, but for the path hand in a string that starts with a null byte. So basically you can bind() or connect() to something like "\x00myname"


Upsides:

  • don't have to avoid existing socket files on filesystem
  • don't have to clean up socket files (kernel removes them when last reference closes)

Caveats:

  • no permissions
  • linux-only

This makes them useful for IPC, but only when trusting everything on the host.


Support: CHECKME

at the very least 2.6 kernels, and enabled.
recent kernels can be assumed to have it, namespaces are central now.(verify)


See also:

Named pipes

Similar to domain sockets in that they are a filesystem thing.

Different in that they do not present a socket interface at all, just bytestreams (FIFO-style), to whatever other process opens that thing as a file.

Created with
mkfifo
(syscall and utility share the same name)

Addresses and nets

*-cast

  • Unicast: Targeted at a specific host via a specific address
combines well with a name service
probably almost all traffic on the internet is unicast
(at least directly speaking; it is sometimes useful to multicast through tunnels, e.g. connecting two clusters of hosts)


  • Broadcast: Targeted at everyone (often "everyone nearby", since it's a bad idea to route broadcasts)
Things like DHCP are broadcast, because you don't have an address yet
Easy for truly local things
Downsides: Local-only may be too local. Also not good for bandwudth when used for anything bulky


  • Multicast: Targeted at anyone registered to listen to a multicast net
Ideally, this can save a lot of bandwidth over lots-of-unicast connections -- when a lot of the path knows it needs to carry only one copy and can split off copies only near receivers
MDNS, the thing that is probably giving you those .local addresses, works via (local) multicast
some service/device detection (SLP, printers, some storage) uses multicast, mostly so that they can be found before that device is configured for multicast (useful during setup)
geocast refers to multicast that specialized in "send it to different geographic regions"


  • Anycast:
regularly work out as "any node who listen to this specific (so yes, shared) address"
implementation of this idea varies.
For example, with IPv4 you can have multiple hosts with the same IP and use BGP. While BGP is intended to a good select route to a single destination, in this case it will effectively choosing a destination. (This trick works better with stateless protocols like UDP than with stateful ones like TCP, because when BGP changes route (which TCP would recover from) it will in this case change destination host)
IPv6 directly supports anycast -- within a subnet(verify)
(note that this is similar to the round-robin DNS trick, but distinct: anycast works at IP addressing / routing level, round-robin works purely in name resolution before that)

IPv4 network notation

The following are the same:

  • 192.0.2.0/255.255.255.0
  • 192.0.2.0/24 (CIDR notation)
  • 192.0.2/24 or things like 10/8 or even 0/8 - a lazy variation on CIDR style that omits trailing zeroes
(so these mean 192.0.2.0/24, 10.0.0.0/8, and 0.0.0.0/8, respectively)
  • 192.0.2. or 192.168. or 127. - another lazy variant (that is also less powerful - it's basically basically a start-of-string match, and implicitly basically octet-broundary'd). apache supports it, but I can't think of anything else right now.

Applications vary in which they support. Usually they stick to the top one or two.


These examples all mention masks/subnets split at whole octets. When you refer to subnets, you say how many (leftmost) bits matter to its definition.

Depending on what you are configuring, you can specify both (sub)net address and a netmask, such as 192.168.110.168/29. The application you hand this to may require you to have the IP have the rightmost bits cleared, as in that example (29 having a 3-bit host part, and 168 being 10101000). The rest of the bits are used by things on that net, so don't matter to the net specification.

IPv4 special addresses

Main relevant RFCs:

  • RFC3330, 'Special-Use IPv4 Addresses' (refers to most other RFCs mentioned here)
  • RFC1918, 'Address Allocation for Private Internets'
  • RFC1700, 'Assigned Numbers'


The private network rangesare defined by RFC 1918:

  • 10.0.0.0/8, that is, 10.0.0.0 through 10.255.255.255
  • 172.16.0.0/12, that is, 172.16.0.0 through 172.31.255.255
  • 192.168.0.0/16, that is, 192.168.0.0 through 192.168.255.255

These will not be routed by hardware by default. To get onto the net, you will need a gateway. (there are also exceptions such as VPN, and VLANs also touch on this). Their not being routed means there can be any number of these nets, which will not conflict with each other. As such, they are e.g. very commonly used for the LAN behind your internet modem.


Well known special cases (see mainly RFC 1700 for these):

  • 255.255.255.255/32 is for 'limited broadcast', in that it should not be routed between nets. (technically in 240/4, see below)
  • 0.0.0.0/32 is regularly used with a meaning like 'any local IP' (a sort of 'this host'), particularly when configuring what IP(s)/adapters a service should listen on.
  • 127/8: Anything to this network should go to the local host, without going to a network card
It is common convention to configure 127.0.0.1/32 on a virtual loopback device.
This net is typically used for local-only uses such as IPC and some services.
can be served via domain sockets, which you can use as network sockets but are cheaper than putting things on the full network stack since the kernel knows beforehand it's local-only.


Other special cases:

  • 224/4 (224.0.0.0-239.255.255.255): Meant for multicast (see RFC3171 and summaries like [2])
  • 240/4 (240.0.0.0-255.255.255.255): "Reserved for future use"
  • 192.0.2/24: intended for examples in documentation and code, and should not be used on the 'net (nor should it be confused with 192.168/16)
  • 198.18/15: to be used for bandwidth tests (see RFC2544)


There are various ranges reserved by IANA but not used (and, sometimes, unlikely to be used for the purpose they seem likely to be reserved to).

This includes 5/8 (which hamachi used) and various others, apparently including 128/8, 191.255/16, 192.0.0/24, 223.255.255/24 and the already-mentioned 240/4.


Also related:


Link-only network

The 169.254/16 link-only network (Microsoft has its own name for it, Automatic Private IP Addressing (APIPA)) is an unroutable network described by RFC 3927.


These addresses are assigned by a host to itself, typically when it is configured to look for a DHCP server but none responds. It may keep trying to reach DHCP while such an IP is configured.

The IP in this range is chosen by the host itself, randomly from its 64K-large range, which likely avoids collisions from other hosts doing the same, particularly since the non-routeability means you'll only see others on the same switch.

The idea is roughly that every computer that does this will end up on the same subnet, and so can communicate with others on the same segment, and that that could be better than nothing.

For example, when you directly connect two computers (configured for DHCP), you can copy files between them without having to bother with manual network configuration.


See also:

Bogons and martians

Bogons refer to packets that claim to come from networks that are currently unused (bogon space).

Martians are packets that claim to come from unroutable nets.

Both are good indicators that the traffic is spoofed, or sometimes that a router is very dumb, or malfunctioning.

See also:

IPv6 address notation

IPv6 addresses are 128-bit numbers. RFC 1884 specified that everything should recognize the following forms:

  • Full form: eight chunks of 16-bit hexadecimal numbers:
FE12:4567:3C23:4984:0011:0000:0000:0006
  • abbreviated form, based on the two rules that:
    • leading zeroes in a block can be removed. 0000 can be 0, 0123 can be 123, etc.
    • :: means "fill with 0000 blocks" which can appear at any point (beginning, middle, end), but at most once (to avoid ambiguities)
  • alternative form: a convenience form in which you specify the last two 16-bit blocks as four decimal numbers (for specifying IPv4 addresses in IPv6 notation) while leaving the rest as hex:
0:0:0:0:0:FFFF:12.127.65.3 
  which can be abbreviated as
::FFFF:12.127.65.3
  and is also just another way of writing
::FFFF:C7F:4103

IPv6 special addresses and ranges

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
IP/Network           Purpose
(abbreviated, CIDR)  
-------------------- --------------
::/8                 Unassigned, but contains special addresses:
::1/128              * loopback
::/128               * unspecified/any  (all zeroes, like 0.0.0.0 in IPv4)
::/96                * IPv4 compatible addresses (obsolete)
::ffff:0:0/96        * IPv4-mappable addresses

20::/3               General allocation ('global unicast')

FE80::/9             Private ranges:
FE80::/10            * link-local unicast 
                       (similar to IPv4's 169.254.0.0/16)
FEC0::/10            * site-local unicast
                       (similar to IPv4's 10/8, 172.16/12 and 192.168/16)

FC00::/7             Unique Local Addresses, similar in purpose to FECO::/10
                     This includes two ranges, FC00::/8 and FD00::/8, 
                     which can be used for /48 subnets 
                     that are only site-routable
                     (40 bits are to be filled in randomly, to avoid 
                      collision should such nets ever merge)

2002::/16            6to4
2001:db8::/32        To be used in examples (like IPv4's 192.0.2.0/24)

FF00::/8             Multicast

2::/7                Reserved for NSAP address allocation

...and a number of unassigned ranges.

See also:


Subnetting(, routing)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Routed networking is largely concerned not with nodes or hosts, but with networks: if routers can put a packet on the net that the target host is on, switches and hosts themselves take care of the last step.


An IP (sub)network is often written/stored as a combination of an address combined with a mask that implies a range of addresses. From such a (an_IP_on_a_subnet, subnet_mask) pair you can derive the size of the net, the network address, and the broadcast address. (Note: when referring to a network, you usually use the network address as the IP to mention in that IP/mask pair)


There are different equivalent ways to refer to a network:

192.168.124.168/255.255.255.248  (a little more reflective of the network stack logic)
192.168.124.168/29               (equivalent, shorter for us to write)

The /number refers to how many leftmost bits are set (Technically, consecutive set bits are not strictly necessary in (sub)networks, but there is more confusion than added value to non-consecutive-bit network masking).

You can also mention them by start and end address, although this is more for reporting than anything else. It can be convenient for human checks when subnets don't start or end on byte edges (anything except /0. /8, /16, /24, and /32). For example, '145.99.239.168 though 145.99.239.175' makes more immediate sense to me than 145.99.239.168/29


When subnetting a given net, you could think of it as using the bits for three different things: the existing network, the new subnetwork, and the host bits.


For example, if you take the private 192.168.0.0/16 network (192.168.0.0 through 192.168.255.255) and split it into 256 separate /24 networks (192.168.0.0/24, 192.168.1.0/24, 192.168.2.0/24, etc.), on a bit level you can think of it as:

nnnnnnnn nnnnnnnn ssssssss hhhhhhhh

...although this is only an aid to modelling, as there is no hard difference between the network and subnetwork bits.


Note that if one computer has the right IP but the wrong netmask (such as 192.168.0.0/255.255.0.0 when it should have 192.168.0.0/255.255.255.0), directed traffic may work but its (specific-net) broadcasts may not.


The earlier about networks over nodes is true for IP networking stacks. Their routing tables primarily deal with sources/targets and masks, and the combination implies what sort of address the source/target is. For example, a routing tables may look something like:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
145.99.239.168  0.0.0.0         255.255.255.248 U     0      0        0 eth0
192.168.0.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
5.0.0.0         0.0.0.0         255.0.0.0       U     0      0        0 ham0
127.0.0.0       0.0.0.0         255.0.0.0       U     0      0        0 lo
0.0.0.0         192.168.2.1     0.0.0.0         UG    0      0        0 eth0

When deciding to what interface a packet should go, the packet's destination IP is tested against each entry (look at destination/genmask). The first that applies is used.

The last entry is often the (default) gateway, which is the fallback that everything gets sent to if nothing else applies. This is handy in home networks, because anything not meant for your local LAN is probably something on the internet, so your DSL/cable modem is the obvious default gateway.


See also:



Switches, routers, hubs, bridges, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

NAT

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

NAT itself refers to translating adresses. It can be a somewhat confusing subject.

  • Full NAT is stateless, and somewhat like routing except that it rewrites packets to the end of making networks appear differently
  • SNAT is stateful, and does half of full nat, rewriting source addresses/ports
  • DNAT is stateful, and does the other half, rewriting target addresses/ports.

(So yes, full stateless NAT can be imitated with two stateful NATs)

Note that you also need ip forwarding on for most of this, and some distros have it off by default, in which case you need to:

echo 1 > /proc/sys/net/ipv4/ip_forward

...or, actually, figure out which configuration file controls the system's doing so at bootup.

SNAT and masquerding

In practice, SNAT is often used to assist sharing an internet connection in a home network.

In this case, there is one gateway host that does the SNATting. It takes connections from the local net, rewrites the source IP so that the response will come back to it. Since it remembers where the connection came from, the data it gets back can easily be sent where the connection actually originated (which is a DNAT step implicit in every SNAT: The packet's destination is rewritten back to the local net (iptables remembered this) and the packet gets back to the original node).

Where PUBLIC_IP is the SNATting node's outside, public IP):

iptables -t nat -A POSTROUTING -o ppp0 -j SNAT --to $PUBLIC_IP


Masquerade, also called dynamic NAT (sometimes dNAT, but usually ot since it's too easily confused with DNAT) is an old version of using SNAT this way. It would automatically picks the interface's IP for the new source IP. The following is similar to the SNAT rule above.

iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE

DNAT

Using DNAT by itself means rewriting things that come in.

It is probably mostly used for port forwarding: if you e.g. have a routing DSL modem and want to connect to an internal computer's remote desktop, web server, or whatever else, use rules like:

iptables -t nat -A PREROUTING -d $externalip -p tcp -m tcp -dport 3389 -j DNAT --to-destination $internalip:3389 
iptables -t nat -A PREROUTING -d $externalip -p tcp -m tcp -dport 80   -j DNAT --to-destination $internalip:80 

...assuming said modem takes such rules somehow, it is connected to a computer that does so for it. The logic's sound.

You can also use DNAT to p


STUN, TURN, ICE

See also:

UPnP, IGD, NAT-PMP, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Two different protocols with the same intent: negotiating automatic port forwarding, so that manually configuring modem DNAT for port forwarding becomes unnecessary.

UPnP implements IGD and is Microsoft's implementation, NAT-PMP is Apple's. The latter came later(verify) and at least initially seemed somewhat restricted to Apple products, and (perhaps so) UPnP is better supported.(verify)


See also:


Connection errors

Connection refused

TCP

TCP windows

TCP needs its packets to be acknowledged, as part of guaranteed delivery.

This makes it a sliding window protocol with extra bookkeeping at both ends.


tl;dr:

  • if you want throughput, both windows need to be large enough. "Large enough" depends on the physical bandwidth and a typical connection's RTT, see BDP
  • the "just increase all the numbers" approach has its limits, in that this can cause things to stutter more on congestion (verify)
  • the autotuning we now have works well.
you may wish to increase the bounds it works in if you have higher-BDP links. Look at net.core.rmem_max, net.core.wmem_max, net.core.rmem_default, net.core.wmem_default, net.core.optmem_max, net.ipv4.tcp_rmem, net.ipv4.tcp_wmem


  • Some software may its own too-small buffers, meaning larger windows won't help their speed
(includes old OpenSSH(verify), old Samba(verify))


The below mostly focuses mostly on how the window, and its settings, could limit throughput - of individual TCP connections, because each connection has its own window, so enough separate connections will typically balance each other, and at some point combine to saturate your interface.


The send and receive windows

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The transmit window stores packets until it gets a corresponding ACK packet from the receiving end - and not earlier, should retransmission be necessary.

When full, the TCP connection will stall until there is space in the window again.

Too small a transmit window (for a particular end-to-end connection) is the case where the window fills up due to higher latency, and it stalls and ends up less than the bandwidth there actually is.

Too large is less serious: If the transmitting side stores more packets than will be in flight, this takes more (kernel) memory than necessary, and if things stutter, they may stutter slightly harder (why?(verify)), though rarely in a serious way.


The receive window (RWIN) stores incoming packets. The receiving end's network stack (in the kernel) needs to do two things: acknowledge the packet(s), and move the data into the application layer.

When the receive window is very small, it may not move out the data fast enough, which would mean the window fills up and further incoming packets can only be dropped, which from the sender's view means it stops ACKing, resulting in congestion control and retransmission -- which looks like a lower speed, stop-and-go flow, and wastes bandwidth.


Slightly too small can limit transfer speed somewhat, because the receiving end will send ACKs which signal how much space is currently in its RWIN, so that the sending side can tune its transmission speed somewhat.

(In theory, delays in moving data out of the receive window to the app also means RWIN can fill up faster. The TCP standard doesn't say much about how soon to move data, though you can generally assume it's faster than your network is)

(you can use wireshark or such to analyse this. In its case, look for tcp.analysis.window_update)


The proper value for both relate both to typical bandwidth and typical round-trip time.

The window size will typically adjust to a healthy value, though(verify)

Real-world behaviour and speed

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


When you want to use all physical bandwidth on a no-loss local-ish connection, you want

  • the sending side to spend no time waiting on the required ACKing before it can continue, and
  • the receiving end to be large enough to handle the bulk such a sender can send.


You can rarely change bandwidth or delay (...other than dedicating a physical line), so this usually means buffer size tweaks.

Roughly speaking, you'ld want that buffer to be the size of unacknowledged data, which is basically the bandwidth-delay product (BDP):

link_bandwidth * round_trip_time

That multiplication is easily a large number when link speeds are high (e.g. gigabit, fiber), the delay is high (e.g. a satellite's ~500ms), or both (e.g. congested satelite). And sometimes quantifiable bounds, e.g. the speed of light (and that fiber and copper are effectively ~0.7c) says trans-atlantic round-trip time can't be better than around 30ms.

Setting everything very high may help, but setting things to a sane numbers is often better.

In theory very high settings are harmless, but if there is expectable congestion or packet loss, and you do not (or cannot) use Selective ACKs, then you can at best use Cumulative ACKs, which do not deal as well with even small amounts of packet loss. In this case, a well-calculated RWINs will often mean the same speed, but a more stable/predictable RTT jitter, which can matter.

As such, RWIN is best tuned to the actual BDP of the connection, and react to loss/congestion. This is why receive-window auto-tuning is a good idea.


For end users, the internet is a Long-Fat Network, because while bandwidth may be high throughout, end-to-end latency quickly becomes the limiting factor.

This basically means that for high-BDP links, the size of the TCP receive window may keep a single connection slower than it could be. (This is one cause of the "single-connection-is-slow, many-connections-can-saturate" effect, but not the only. Another common cause is that a protocol spends time doing work between actual sends. The higher the speed of sending, the more that matters, relatively speaking)


Window size is communicated. In TCP it is a 16-bit field that originally could only indicate 64K, which now often isn't enough for good performance - see the following table

Without window scaling, assuming the maximum 64KiB,...

  • 250ms latency: 2MBit/s (≈250KB/s)
  • 150ms latency: 3.5MBit/s (≈400KByte/s)
  • 50ms latency: 10MBit/s (≈1.3MByte/s) (e.g. a not so good internet connection)
  • 15ms latency: 35MBit/s (≈4.5MByte/s) (e.g. a good internet connection)
  • 5ms latency: 100MBit/s (≈13MByte/s)
  • 0.5ms latency: 1000MBit/s (≈130MByte/s) (e.g. home LAN)

You can also take that as a rough 'for this speed I need lower than this latency' list.


Note, by the way, that windows apply per connection, so these speeds apply per connection. Until you saturate a link somewhere, more connections are the easier fix.


http://www.speedguide.net/articles/the-tcp-window-latency-and-the-bandwidth-delay-2678 http://www.kehlet.cx/articles/99.html


For this reason, you typically want to use TCP window scaling, which is a trick applied to the TCP handshake that basically lets say "I actually mean a two-power multiple(verify) of this" which, tl;dr, is necessary for fast transfers when the BDP is larger than 64KiB.

Most TCP stacks (linux and otherwise(verify)) implement window scaling.

Some older routers don't understand window scaling very well, which can do more harm than good to your connection. These problems are now fairly rare, so you can usually safely enable this.

Delayed, Cumulative, and Selective ACKs

RWIN and the MSS

Linux tweaking practive

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Congestion avoidance

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Interesting TCP states

See TCP state diagrams: (google search).


For a quick and dirty summary of yours, per protocol (except for unix sockets):

netstat -pna | egrep '^(tcp|udp)' | \
  sed -r 's/[- ]{2,}/ /g' | cut -d ' ' -f 1,6 | sed -r 's/(udp).*/\1/' | \
  sort | uniq -c | sort -rn

Opening

TCP: Possible SYN flooding on port 80. Sending cookies
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The SYN flag (in the TCP header) is only used at the start of a new connection, part of the three-way handshake, (SYN request, SYN/ACK response, ACK and the connection is established).


If a client does sends only the request, but not the final acknowledgment, the server keep these half-open connections around. Once the limit of these half-open connections is reached, it has to deny new connections.

This happens for one of two reasons:

  • A SYN flood - sometimes a buggy program, but often someone intentionally sending a mass of such initial packets, effectively a Denial of Service attack.
SYN cookies are a useful way to deal with that. It's a clever way to not queue all requests, but can still deal with real clients's completing handshake. (...and while it does not violate specs it does have some caveats, which is why you would not want this as default behaviour, but is very nice fallback while under attack)
  • Very busy servers, seeing a lot of legitimate connections (particularly with very short-lived connections)
you can consider
increasing net.ipv4.tcp_max_syn_backlog, the global max (verify), from the default 512 or 1024 (for memory-restricted machines) to e.g. 4096
increasing net.core.somaxconn, which seems to be the per-port max backlog (verify), from the default 128 to e.g. 1024
Keep in mind specific programs can request (/be compiled with?(verify)) another value


SYN_SENT

Our side, acting as a client, has tried to set up a connection and sent SYN, but have not received ACK.

Since the remote side would often either complete the connection or reject it, this usually means a remote firewall that drops packets instead of rejecting them (...usually to make life a little harder for port scanners and DoS attacks, by having the client network stack leave a connection open for some time(verify))


Closing

Notes:

  • TCP requires both ends to signal the close (via FIN packets)
  • You get TIME_WAIT if the local side closes first, and CLOSE_WAIT if the remote end closes first.
  • A half-open (only one side closed) connection is a perfectly valid state


CLOSE_WAIT

Techincally means:

The other end has sent us a FIN, to signal it considers the connection closed.
The local side has acknowledged this one-directional close.
The local side has not (yet) decided closed the connection.

(In other words, closing is a three-way thing ,just as establishing one is)


So the remote has said "I have no more data to send" (which is what CLOSE means).

In most cases, the protocols will also have just sent some form of "ok bye now (you can close too)" message, and/or the protocol specs say when a close is implied (see e.g. [Webpage_performance_notes#Persistent_connections details to HTTP 1.0 and 1.1]).


If neither of those, the receiving side still has a connection open. At TCP level such a one-way connection is perfectly valid, and sometimes useful, so the TCP stack will not close it automatically - there are no related TCP timeouts.

In some cases you may know this is completely useless, e.g. when because the protocol on top is query-response), but it's up to that protocol to be proper about connections.


If this is not intentional, the the other side has left, but our local side thinks is one-sided, and will only realize it's dead if and when it tries to send on it (and the other side says "that connection doesn't even exist"), which could be microseconds or weeks later.


Seeing some may be okay, because it often means a close happens within predictable time, e.g. because of heartbeat packets, or specced timeouts. In some cases adding some timeout in your code fixes all.

It may be a good idea to increase the maximum open sockets, but this is otherwise fine.


Seeing many sockets in CLOSE_WAIT usually points to them not being cleaned up within sensible time, and you should complain to the respective programmer.


TIME_WAIT

Our side initiated the connection close.

After that finishes, the socket intentionally lingers in the TIME_WAIT state before being removed, officially for twice the Maximum Segment Lifetime (MSL), where the MSL is usually something like 120, 90, 60, or 30 seconds.


The purposes seem to be:

  • avoid duplicate segments to be accepted from an unrelated connection
which is unlikely to happen as for that to happen the new connection would need to be the same socket in the 5-tuple sense: (proto, local_addr, local_port, remote_addr, remote_port)).
and in practice we could also use the timestamp, which makes it much less likely
  • (also to discard segments that come in after we close a connection, but that's implied by there not being a connection (verify))
  • avoid the other end sticking in a half-finished teardown for a while (verify)


Accepting a packet to the wrong connection takes a pretty pathological situation (same proto, local_addr, local_port, remote_addr, remote_port, and sequence), but is technically possible, so the network stack shouldn't ignore the possibility.


Good network stacks can easily deal with thousands of such sockets in soon-to-disappear states with barely measurable performance differences(verify), and a socket takes on the order of 1.5KB in memory, so these TIME_WAIT sockets usually have very little impact on resources or performance.


If you expect to see so many short term connections that you can run into some limits, you can:

  • allocate more ports to be /proc/sys/net/ipv4/ip_local_port_range
...some systems seem to allocate ~4k by default, others ~30k. If the latter is not enough, think about what you're doing.
  • tell the OS to remove these sockets faster
...but be aware that this is a game of weighing probability with likely client behaviour.
On linux, you can set the time sock a socket will linger (note: this is not the MSL itself) by setting tcp_fin_timeout:
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
sysctl -w net.ipv4.tcp_fin_timeout=30
  • tell the OS TIME_WAIT sockets can be reused immediately under certain near-safe conditions
e.g. linux can be told to do so only when the new timestamp is bigger than the most recent timestamp - as that avoids the duplicate-segment case
See net.ipv4.tcp_tw_reuse
however, some load balancers and firewalls are known to expect more compliant behaviour, and may reject such reuse (See also RFC 1122)


Effect on restarting services

If you have a daemon listening on a specific port, and that was left in TIME_WAIT, this amounts to having to wait until you can restart a server on the same port.

Note that if a listening connection has no connections, the port can be safely closed and re-used immediately by the same sort of server. Take a look at SO_REUSEADDR (basically, if the local_address/port combination you try to bind to is in TIME_WAIT, you can reuse it. If it is in another state (probably 'in use'), it will still fail).

See also http://hea-www.harvard.edu/~fine/Tech/addrinuse.html


Delayed ACK

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


TCP_NODELAY and Nagle's Algorithm

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)



ICMP

Meanings of common messages

ICMP (Internet Control Message Protocol) is essentially a meta-protocol for IP that signals some common problems, including physical, routing and configuration problems. The meaning of some common messages:

  • Net Unreachable (from a router) - Signals that there is no applicable route, meaning we don't know where to send a packet to get it to its target network.
    • If the host is up, but cannot be reached from a host outside the net it is on (or only from some), this usually means a routing table somewhere is faulty.
    • If you cannot get onto the internet on your home network, this often means you do not have a default route/gateway, or that gateway is not properly set up. (...which is basically the same case, but in the other direction)
  • Host Unreachable (from a router) - the route to the network the host is in is known, but the host within it cannot be seen by whatever node is returning this. The host may simply have recently gone offline, may be purposefully unresponsive(verify). One reason is ARP lookup failure(verify). This itself can be caused by bad subnet/routing config, when it means the router thinks it is routing to the correct subnet, but the host does not respond in just that(verify)).
  • Protocol Unreachable (from the target host(verify))
    • Assuming you're using IP, it often means a very minimal device that has omitted support for either UDP or TCP, and is signaling this.
  • Port Unreachable (from the target host(verify)) - Host is reachable and the protocol responds, but the requested port is not responding, probably either because nothing is listening/accepting on it, or the firewall is blocking this port.

There are more; see e.g. [3]


The ICMP Cannot Fragment message happens when packets were sent with the 'do not fragment' flag set, but a router has to fragment according to its route. Now commonly the result of path MTU discovery.

Fragmenting can be a problem when you tunnel things, such as with IPSEC, or your internet connection is PPPoA or PPPoE (PPP over ATM or Ethernet, respectively)