Networking notes - Layer 3 (IP mostly)

From Helpful
(Redirected from Named socket)
Jump to navigation Jump to search

For other network related things, see:


Also:

Sockety things

There is a POSIX socket API, evolved from an earlier Berkeley API.

Technically it was a standardization of early TCP, that also allowed wider set of protocols.

Named sockets

Named sockets are a somewhat ambiguous name, in that some people use them to refer to abstract sockets, some to named pipes.


Local sockets / unix domain sockets

POSIX calls it PF_LOCAL

BSD calls it PF_UNIX, which is #define'd to the same thing.


Both ends of such a socket connect, via code[1], to what to your code looks like a file on the filesystem.

On the code side it presents a fairly normal socket interface - you can e.g. send() and recv()

Security is mostly the filesystem permissions to that socket file, when you open() it.


Since the API behind this is is essentially its own tiny subsystem, it fundamentally cannot be network-routed. Since it does not involve IP, there are no ports


Note this is distinct from localhost IP. This is a special case within what is still the network stack. (While this is also not routable, that's because of the network stack refuses to do so) Local sockets are simpler system and can be a little lower overhead than localhost networking.

This makes them useful for local IPC (both because it's fast, and because you can't accidentally open it up to the network with config handing parameters to listen())


Note that it is also distinct from named pipes.


See also:

Abstract sockets

Abstract sockets are local sockets (AF_LOCAL/AF_UNIX, see previous section) but in the special-cased abstract namespace.


The abstract namespace basically means that it lets you bind sockets to names that are managed via the kernel, rather than filenames on a filesystem (as with regular local sockets).


The way this is implemented in linux allows you to create a local socket as per usual AF_LOCAL way, when you hand in a path string that starts with a 0x00 (null) byte.

So basically, you bind() or connect() to something like "\x00myname"


Upsides:

  • you can avoid existing socket files on filesystem
  • you don't have to clean up socket files (kernel removes them when last reference closes)

Caveats:

  • no permission limitations at all now
  • linux-only

This makes them useful for IPC on linux, but only when you trust everything and everyone on the host.


Support: CHECKME

at the very least 2.6 kernels, and enabled.
recent kernels can be assumed to have it(verify), namespaces have been important for a while.


See also:

Named pipes

A special kind of filesystem entry.


Similar to domain sockets in that they are a filesystem thing for local communication.

Different in that they do not present a socket interface at all, just a FIFO-style bytestream, to whatever other process also opens it that file.


Created with mkfifo (syscall and utility share the same name)

Why connections can be on the same port, and binds cannot

IPv4

IPv4 addresses and nets

IPv4 network notation

The following are the same:

  • 192.0.2.0/255.255.255.0
  • 192.0.2.0/24 (CIDR notation)
  • 192.0.2/24
  • 192.0.2.


192.0.2/24 is an example of the CIDR-without-trailing-zeroes. Similarly, 10/8 means 10.0.0.0/8.

192.168. or 127. is another lazy variant, but less powerful as it's basically a start-of-string match, and implicitly octet-boundary-only.

Applications vary in which they support. Usually they stick to the first one or two.


These examples all mention masks/subnets split at whole octets. When you refer to subnets, you say how many (leftmost) bits matter to its definition.

Depending on what you are configuring, you can specify both (sub)net address and a netmask, such as 192.168.110.168/29. The application you hand this to may require you to have the IP have the rightmost bits cleared, as in that example (29 having a 3-bit host part, and 168 being 10101000). The rest of the bits are used by things on that net, so don't matter to the net specification.

IPv4 special addresses

Main relevant RFCs:

  • RFC3330, 'Special-Use IPv4 Addresses' (refers to most other RFCs mentioned here)
  • RFC1918, 'Address Allocation for Private Internets'
  • RFC1700, 'Assigned Numbers'


Well known special cases (see mainly RFC 1700 for these):

  • 255.255.255.255/32 is for 'limited broadcast', in that it should not be routed between nets. (technically in 240/4, see below)
  • 0.0.0.0/32 is regularly used with a meaning like 'any local IP' (a sort of 'this host'), particularly when configuring what IP(s)/adapters a service should listen on.
  • 127/8: Anything to this network should go to the local host, without going to a network card
It is common convention to configure 127.0.0.1/32 on a virtual loopback device.
This net is typically used for local-only uses such as IPC and some services.
can be served via domain sockets, which you can use as network sockets but are cheaper than putting things on the full network stack since the kernel knows beforehand it's local-only.


Private networks

The private network rangesare defined by RFC 1918:

  • 10.0.0.0/8, that is, 10.0.0.0 through 10.255.255.255
  • 172.16.0.0/12, that is, 172.16.0.0 through 172.31.255.255
  • 192.168.0.0/16, that is, 192.168.0.0 through 192.168.255.255


These are private in that these will not be routed onto the public internet.

This means you can use these networks at will, with no fear of conflicting with anyone else who uses such networks elsewhere.


To get hosts on IPv4 private nets to talk to the internet, you will need a gateway (which at home will be your modem).

To the rest of the world, you look like you come from your gateway(e.g. modem)'s public IP.


They were defined to delay the exhaustion of public IPv4 addresses. While these ~17 million addresses cannot exist on the internet, none of the hosts on a private net doesn't count towards the use of public IP addresses.

Which is why public nets are very commonly used for the home LAN behind your modem. And other cases where you case that hosts can reach the internet, but the internet doesn't need to initiate connections with you directly.


Notes:

  • Some of the above statements have exceptions - see e.g. the subnets used in VPN, what VLANs do to subnets, and more.
  • This is mostly an IPv4 implementation concept, in IPv6 it's a different story.

Other special cases

Other special cases:

  • 240/4 (240.0.0.0-255.255.255.255): "Reserved for future use"
  • 192.0.2/24: intended for examples in documentation and code, and should not be used on the internet (nor should it be confused with 192.168/16)
  • 198.18/15: to be used for bandwidth tests (see RFC2544)


There are various ranges reserved by IANA but not used (and, sometimes, unlikely to be used for the purpose they seem likely to be reserved to).

This includes 5/8 (which hamachi used) and various others, apparently including 128/8, 191.255/16, 192.0.0/24, 223.255.255/24 and the already-mentioned 240/4.


Also related:


Multicast
Link-only network

The 169.254/16 link-only network is an unroutable network described by RFC3927. (Microsoft has its own name for it, Automatic Private IP Addressing (APIPA))

These addresses may be assigned when a host configured to use DHCP doesn't get a response. (Note: it may keep trying to reach DHCP while such an IP is configured, it may not).


The idea is roughly that if your DHCP server is broken, hosts on that subnet may still see each other and communicate, in that every computer with this behaviour this will end up on the same subnet, and probably not with the same IP (it's a 64K-large range and we pick a number in it randomly).


In practice it isn't too useful because people want internets, but I've seen it used to copy a few files.


See also:

Bogons and martians

Bogons refer to packets that claim to come from networks that are currently not allocated for anyone to use (bogon space).


Martians are packets that claim to come from unroutable nets.


Both are good indicators that the traffic is spoofed, or sometimes that a router is very dumb or malfunctioning.


See also:

Subnetting(, routing)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Routed networking is largely concerned not with nodes or hosts, but with the network ranges they are on: if routers can put a packet on the net that the target host is on, switches and hosts themselves take care of the last step.


An IP (sub)network is often written/stored as a combination of an address combined with a mask that implies a range of addresses it is in.


From such a (IP_on_a_subnet, subnet_mask) pair you can derive the size of the net, the network address, the broadcast address, and the addresses that may be hosts.

(When referring to a network, you usually use the network address as the IP to mention in that IP/mask pair, e.g. 192.168.1.0/24, but in a lot of practical cases, any IP in that range (say, 192.168.1.47/24) will work equivalently, exactly because network code can derive the network address with just one bitmask operation)


There are different equivalent ways to refer to a network:

192.168.124.168/255.255.255.248  (a little more reflective of the network stack logic)
192.168.124.168/29               (equivalent, shorter for us to write)

The /number refers to how many leftmost bits are set (Technically, consecutive set bits are not strictly necessary in (sub)networks, but there is more confusion than added value to non-consecutive-bit network masking).

You can also mention them by start and end address, although this is more for reporting than anything else. It can be convenient for human checks when subnets don't start or end on byte edges (anything except /0. /8, /16, /24, and /32). For example, '145.99.239.168 though 145.99.239.175' makes more immediate sense to me than 145.99.239.168/29


When splitting a given net into subnets, you could think of it as using the bits for three different things: the existing network, the new subnetwork, and the host bits.

Note: this is only an aid to modelling, as there is no hard difference between the network and subnetwork bits, but I'd say it helps.

For example, if you take the private 192.168.0.0/16 network (192.168.0.0 through 192.168.255.255) and want to use it as 256 separate /24 networks (192.168.0.0/24, 192.168.1.0/24, 192.168.2.0/24, etc.), on a bit level you can think of it as:

nnnnnnnn nnnnnnnn ssssssss hhhhhhhh


Note that if one host has the right IP but the wrong netmask (such as 192.168.1.1/16 when it should be 192.168.1.1/24), this may go weird. Directed traffic may still work, but that hosts's broadcasts will probably not.



The earlier note about "routing is towards networks, not nodes" is true mostly because of what routing tables contain. What they contain also implies what sort of address the source/target is. For example, a routing tables may look something like:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
145.99.239.168  0.0.0.0         255.255.255.248 U     0      0        0 eth0
192.168.0.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
5.0.0.0         0.0.0.0         255.0.0.0       U     0      0        0 ham0
127.0.0.0       0.0.0.0         255.0.0.0       U     0      0        0 lo
0.0.0.0         192.168.2.1     0.0.0.0         UG    0      0        0 eth0

When deciding to what interface a packet should go, the packet's destination IP is tested against each entry (look at destination/genmask). The first rule that applies is used.


The last entry is often the default gateway, in an "anything that wasn't matched by the above is routed here" sort of rule.

This is handy in home networks, because that will include all of the public internet, and you know that your gateway will be doing the right thing (NAT) on your behalf.


See also:

VLANs

NAT

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


NAT itself refers to translating adresses.

Which can be a somewhat confusing subject.

  • source NAT is stateful, and does half of full nat, rewriting source addresses/ports
a.k.a. one-to-many NAT, mainly used for outgoing connections to public services.
  • destination NAT is stateful, and does the other half, rewriting target addresses/ports.

(So yes, full stateless NAT can be imitated with two stateful NATs)

  • In the same context, Full NAT rewrites packets to the end of making networks appear differently
stateless in the sense there is no need for memory of connections, as is often the case with source NAT and destination NAT


The last two are in some contexts called SNAT and DNAT (e.g. varied *nix networking), but other things have been acronym'd SNAT, and multiple things called DNAT. Also, what linux calls SNAT and DNAT are both types of static NAT, not dynamic NAT. Aren't acryonyms just so useful? In fact, there are distinct things called dynamic NAT.)

Note that you also need ip forwarding on for most of this, and some distros have it off by default, in which case you need to:

echo 1 > /proc/sys/net/ipv4/ip_forward

...or, actually, figure out which configuration file controls the system's doing so at bootup.



Source NAT and masquerding

In practice, Source NAT is mostly used to share an internet connection in a home network which internally has a private (IPv4) network.


In that case there is one host that does the translation (which is very typically also that network's gateway).

It takes connections from the local net that it notes are bound to go outside, rewrites the source IP so that the response will come back to the IP that that gateway has on said outside (where outside = internet).


It remembers that connection, so that once a packet does get back, it knows which local/private address to move that packet to.

Note that this is technically a destination-NAT step implicit in every source-NAT situation: The packet's destination is rewritten back to the local net (iptables remembered this) and the packet gets back to the original node). Yet the way this is usually configured, this is implied by the source-NATting.


e.g. in linux,

iptables -t nat -A POSTROUTING -o ppp0 -j SNAT --to $PUBLIC_IP

...where PUBLIC_IP is the SNATting node's outside, public IP):


Masquerade is an old version of using SNAT for this, that would automatically picks the interface's IP for the new source IP. The following is similar to the SNAT rule above:

iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE

Destination NAT

Using Destination NAT (linux abbreviates this DNAT) by itself means rewriting things that come in.

It is probably mostly used for port forwarding: if you e.g. have a routing DSL modem and want to connect to an internal computer's remote desktop, web server, or whatever else, use rules like:

iptables -t nat -A PREROUTING -d $externalip -p tcp -m tcp -dport 3389 -j DNAT --to-destination $internalip:3389 
iptables -t nat -A PREROUTING -d $externalip -p tcp -m tcp -dport 80   -j DNAT --to-destination $internalip:80 

...assuming said modem takes such rules somehow, it is connected to a computer that does so for it. The logic's sound.



STUN, TURN, ICE

https://dyte.io/blog/webrtc-102-demystifying-ice/



See also:

UPnP, IGD, NAT-PMP, etc.

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Two different protocols with the same intent: negotiating automatic port forwarding, so that manually configuring modem DNAT for port forwarding becomes unnecessary.

UPnP implements IGD and is Microsoft's implementation, NAT-PMP is Apple's. The latter came later(verify) and at least initially seemed somewhat restricted to Apple products, and (perhaps so) UPnP is better supported.(verify)


See also:

IPv4 multicast

IP-level implementation of the more general multicast idea


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


IGMP

Internet Group Management Protocol (IGMP) is how devices and multicast-aware switches make IUPv4 multicast actually go.



ICMP

Meanings of common messages

ICMP (Internet Control Message Protocol) is essentially a meta-protocol for IP that signals some common problems, including physical, routing and configuration problems.


The meaning of some common user messages that come in via ICMP (there are a bunch more, see e.g. [3]):

  • Net Unreachable (from a router)
basically means a router along the way there is no applicable route for the address
meaning it don't know where to send a packet to get it to its target network.
If you know the host is up, but you get this error, it usually means a routing table somewhere is faulty - and some (or all) hosts elsewhere can't get to it
If you get this on basically all addresses, on your home network, this often means you do not have a default route/gateway set, or that gateway is not properly set up. (...which is basically the same case, but in the other direction)
  • Host Unreachable (from a router) - the route to the network the host is in is known, but the host within it cannot be seen by whatever node is returning this.
The host may simply have recently gone offline, may be purposefully unresponsive(verify). One reason is ARP lookup failure(verify).
This itself can be caused by bad subnet/routing config, when it means the router thinks it is routing to the correct subnet, but the host does not respond in just that(verify)).
  • Protocol Unreachable (from the target host(verify))
Assuming you're using IP, it often means a very minimal device that has omitted support for either UDP or TCP, and is signaling this.
  • Port Unreachable (from the target host(verify))
Host is reachable and the protocol responds, but the requested port is not responding
probably either because nothing is listening/accepting on it, or the firewall is blocking this port.


The ICMP Cannot Fragment message happens when packets were sent with the 'do not fragment' flag set, but a router knows (according to the applicable route) it has to fragment.

Unnecessary fragmenting can easily happen e.g. when you tunnel things, such as with IPSEC, or your internet connection is PPPoA or PPPoE(PPP over ATM or Ethernet, respectively).

To minimize unnecessary fragmenting, things like Path MTU discovery discovers the largest MTU in a path basically by trying a number of sizes (and 'do not fragment' flag set), and seeing when this message stops happening.

TCP

TCP windows

TCP needs its packets to be acknowledged, as part of guaranteed delivery.

There is a maximum amount of unacknowledged data (correlates to unacknowledged packets) that can be in flight at one time.

This makes it a sliding window protocol with extra bookkeeping at both ends.

And has some implications.


tl;dr:

  • if you want throughput, both windows need to be large enough. "Large enough" depends on the physical bandwidth and a typical connection's RTT, see BDP
  • the "just increase all the numbers" approach has its limits, in that this can cause things to stutter more on congestion (verify)
  • the autotuning we now have works well.
you may wish to increase the bounds it works in if you have higher-BDP links. Look at net.core.rmem_max, net.core.wmem_max, net.core.rmem_default, net.core.wmem_default, net.core.optmem_max, net.ipv4.tcp_rmem, net.ipv4.tcp_wmem


  • Some software may its own too-small buffers, meaning larger windows won't help their speed
(includes old OpenSSH(verify), old Samba(verify))


The below mostly focuses mostly on how the window, and its settings, could limit throughput - of individual TCP connections, because each connection has its own window, so enough separate connections will typically balance each other, and at some point combine to saturate your interface.


The send and receive windows

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The transmit window stores packets until it gets a corresponding ACK packet from the receiving end - and not earlier, should retransmission be necessary.

When full, the TCP connection will stall until there is space in the window again.

Too small a transmit window (for a particular end-to-end connection) is the case where the window fills up due to higher latency, and it stalls and ends up less than the bandwidth there actually is.

Too large is less serious: If the transmitting side stores more packets than will be in flight, this takes more (kernel) memory than necessary, and if things stutter, they may stutter slightly harder (why?(verify)), though rarely in a serious way.


The receive window (RWIN) stores incoming packets. The receiving end's network stack (in the kernel) needs to do two things: acknowledge the packet(s), and move the data into the application layer.

When the receive window is very small, it may not move out the data fast enough, which would mean the window fills up and further incoming packets can only be dropped, which from the sender's view means it stops ACKing, resulting in congestion control and retransmission -- which looks like a lower speed, stop-and-go flow, and wastes bandwidth.


Slightly too small can limit transfer speed somewhat, because the receiving end will send ACKs which signal how much space is currently in its RWIN, so that the sending side can tune its transmission speed somewhat.

(In theory, delays in moving data out of the receive window to the app also means RWIN can fill up faster. The TCP standard doesn't say much about how soon to move data, though you can generally assume it's faster than your network is)

(you can use wireshark or such to analyse this. In its case, look for tcp.analysis.window_update)


The proper value for both relate both to typical bandwidth and typical round-trip time.

The window size will typically adjust to a healthy value, though(verify)

Real-world behaviour and speed

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Some relatively fixed things


Latency has some quantifiable bounds because physics, particularly the speed of light. Fiber and copper in practice go at around 0.7 times the speed of light) and thereby distance, and then there's switching overhead.

Such are reasons that rround-trip time over an ocean can't do much better much better than ~30ms, and satellite uplinks will have an RTT of at least 250ms (for typical orbits).


Also, broadband often seems to add 5-10ms to the RTT just getting down the street over a different medium.(verify)



Other quantification

The bandwidth-delay product (BDP) is literal:

link_bandwidth * round_trip_time

That multiplication is easily a large number when

link speeds are high (e.g. gigabit, fiber),
the delay is high (e.g. satellite's ~500ms),
and especially both (e.g. satellite)



For end users, the internet is a Long-Fat Network (meaning highish BDP), because while bandwidth may be high throughout, end-to-end latency quickly becomes a limiting factor.



TCP windows, RTT, and bandwidth

The TCP window is roughly the amount of unacknowledged data, considered still in flight.

If this is limited, that limits the speed. For the same sized window, higher latency means slower acknowledgment, and you can imaging that means lower transfer speed.

For high-BDP links, the size of the TCP receive window easily keeps a connection's transfer slower than it could be.

Window size is communicated.

In TCP it originally was a 16-bit field use as-is, meaning a maximum 64KiB, which roughly means

  • 250ms latency: 2MBit/s (≈250KB/s)
  • 150ms latency: 3.5MBit/s (≈400KByte/s)
  • 50ms latency: 10MBit/s (≈1.3MByte/s) (e.g. a not so good internet connection)
  • 15ms latency: 35MBit/s (≈4.5MByte/s) (e.g. a good internet connection)
  • 5ms latency: 100MBit/s (≈13MByte/s)
  • 0.5ms latency: 1000MBit/s (≈130MByte/s) (e.g. home LAN)


Notes:

  • you can take that roughly as a "for this speed I need lower than this latency" list.
  • each TCP connection has a window.
This is one of a few reasons why any single connection may hit a limit, but many such connections can saturate some link (yours, or another)


The above is why you typically want to use TCP window scaling, which is a convention/trick applied to the TCP handshake that lets you say "I actually mean a (two-power) multiple(verify) of what I say as the window size."

Most TCP stacks (linux and otherwise(verify)) implement window scaling.

Some older routers don't understand window scaling very well, which can do more harm than good to your connection. These problems are now fairly rare, so you can usually safely enable this, and you won't see slowdown due to small TCP window sizes.



So is high window size always better?

Setting buffer/window sizes very high may help maximum speed, but setting sensible for a particular link is often better.

In theory very high settings are harmless, but if there is congestion or packet loss, and you do not (or cannot) use Selective ACKs, then you can at best use Cumulative ACKs - which do not deal as well with even small amounts of packet loss.

In such a case, a well-calculated RWINs will often mean the same speed as a larger one - but with more stable/predictable RTT jitter(verify), which can matter.

As such, RWIN is best tuned to the actual BDP of the connection, and react to loss/congestion. This is why receive-window auto-tuning is a good idea.



Delayed, Cumulative, and Selective ACKs

RWIN and the MSS

Linux tweaking practive

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Congestion avoidance

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


TCP slow start

https://en.wikipedia.org/wiki/TCP_congestion_control#Slow_start

Interesting TCP states

See TCP state diagrams: (google search).


For a quick and dirty summary of yours, per protocol (except for unix sockets):

netstat -pna | egrep '^(tcp|udp)' | \
  sed -r 's/[- ]{2,}/ /g' | cut -d ' ' -f 1,6 | sed -r 's/(udp).*/\1/' | \
  sort | uniq -c | sort -rn

Opening

TCP handhake
SYN_SENT

The SYN flag (in the TCP header) is only used at the start of a new connection, part of the three-way handshake, (which in the 'actually makes a connection' case consists of a SYN request, SYN/ACK response, ACK and the connection is established).


Our side, acting as a client, has tried to set up a connection and sent SYN, but have not received ACK.

Since the remote side would often either complete the connection or reject it, this usually means a remote firewall that drops packets instead of rejecting them (...usually to make life a little harder for port scanners and DoS attacks, by having the client network stack leave a connection open for some time(verify))


SYN_RECEIVED (a.k.a. SYN_RECV, SYN_RCVD)
TCP: Possible SYN flooding on port 80. Sending cookies
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


As mentioned above, the TCP's handhake is c→s:SYN, c←s:SYN+ACK, c→s:ACK.

If a client starts startup sends only the SYN request, but does not send that third step's ACK, then the server keeps these half-open connections around.


A networking stack will place a limit to connections in that state, and once reached, denies new connections.

This typically happens for one of two reasons:

  • Very busy servers, seeing a very high amount of legitimate connections (particularly when most connections are also very short-lived)
you can consider
increasing net.ipv4.tcp_max_syn_backlog, the global max (verify), from the default 512 or 1024 (for memory-restricted machines) to e.g. 4096
increasing net.core.somaxconn, which seems to be the per-port max backlog (verify), from the default 128 to e.g. 1024
Keep in mind specific programs can request (/be compiled with?(verify)) another value
  • A SYN flood - sometimes a buggy program, but often someone intentionally sending a mass of such initial packets
...in the hope of effectively being a Denial of Service attack
SYN cookies are a useful way to deal with that.
It's a clever way to not queue all requests, but can still deal with real clients's completing handshake. (...and while it does not violate specs, it does have some caveats, which is why you would not want this as default behaviour, but is very nice fallback while under attack)
note that overly port scans can also do this

Closing

Notes:

  • Closing a connection is a cooperative thing between both sides.
in other words, TCP requires both ends to agree for the connection to be completely closed
  • You get TIME_WAIT if the local side closes first, and CLOSE_WAIT if the remote end closes first.
  • A half-open (only one side closed) connection is a perfectly valid state


CLOSE_WAIT

See e.g. this diagram


CLOSE_WAIT is a specific stage in that.

Basically:

The other end has sent us a FIN, to signal it considers the connection closed - an "I have no more data to send"
Our side has acknowledged this one-directional close.
Our side has not (yet) decided to close from this side.


In most real-world cases, most protocols on top

will say how to explicitly agree to close (e.g. might have sent data signifying a close to the server) and now send form of "okay, bye now, you can close too" message

AND/OR

will, via protocol specs, say when a close should be done by implication (see e.g. details to HTTP 1.0 and 1.1 (non)-persistent connections).


...if not, then our side still has a connection open.

At TCP level such a one-way connection is perfectly valid, sometimes even useful.

So the TCP stack will not close it automatically - there are no related TCP timeouts.


If intentional, both sides will still have the socket, and we can continue sending on it in the leftover direction.

If this is not intentional, the the other side has left, but our local side thinks is one-sided, and will realize it's dead only once we next try to send on it, at which point the other side's network stack says "erm, that socket doesn't even exist", which might be be microseconds or weeks later.


In some cases you may know a specific CLOSE_WAIT connection will not communicate, such as when the protocol on top is purely query-response (like HTTP 1.x), but it's up to that protocol to be proper about connection closes. It is effectively a minor flaw in the protocol specs to not have specified that that connection close happen as early as sensible for it.


Seeing a small fraction of CLOSE_WAIT sockets may be okay, simply because even when the protocol deals with it some way or other (specced timeouts, heartbeat packets, etc), it may take some time. Even when well defined and short, having a lot of connections means you'll always have a few in this state, and that's expected.

You may consider increase the maximum open sockets, but this is otherwise fine.


Seeing many sockets in CLOSE_WAIT usually points to them not being cleaned up within sensible time, and you should complain to the respective programmer.


TIME_WAIT

A stage in a closing TCP connection. See e.g. this diagram


Our side initiated the connection close.

After that, the socket is intentionally kept in the TIME_WAIT state before being removed, officially for twice the Maximum Segment Lifetime (MSL), where the MSL is usually something like 120, 90, 60, or 30 seconds.


The purposes seem to be:

  • avoid duplicate segments to be accepted from an unrelated connection
that takes a relatively pathological situation:
same socket, in the 5-tuple sense: (proto, local_addr, local_port, remote_addr, remote_port)).
where at least one of those port choices is likely to come from the host's ephemeral ports pool (so comes from an incremental or random choice in a range of thousands or tens of thousands)
in practice the same sequence number as well(verify)
making it statistically very unlikely to happen
...but is technically possible to happen accidentally, and a little likelier to do purposefully, so the network stack plays it rather safe by default.
  • (also to discard segments that come in after we close a connection, but that's implied by there not being a connection (verify))
  • avoid the other end sticking in a half-finished teardown for a while (verify)



Good network stacks can easily deal with thousands of such sockets in soon-to-disappear states with barely measurable performance differences(verify), and a socket takes on the order of 1.5KB in memory, so these TIME_WAIT sockets usually have very little impact on resources or performance.


If you expect to see so many short term connections that you can run into some limits, you can:

  • allocate more ports to be /proc/sys/net/ipv4/ip_local_port_range
...some systems seem to allocate ~4k by default, others ~30k. If the latter is not enough, think about what you're doing.
  • tell the OS to remove these sockets faster
...but be aware that this is a game of weighing probability with likely client behaviour.
On linux, you can set the time sock a socket will linger (note: this is not the MSL itself) by setting tcp_fin_timeout:
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
sysctl -w net.ipv4.tcp_fin_timeout=30
  • tell the OS TIME_WAIT sockets can be reused immediately under certain near-safe conditions
e.g. linux can be told to do so only when the new timestamp is bigger than the most recent timestamp - as that avoids the duplicate-segment case
See net.ipv4.tcp_tw_reuse
however, some load balancers and firewalls are known to expect more compliant behaviour, and may reject such reuse (See also RFC 1122)


Effect on restarting services

If you have a daemon listening on a specific port, and that was left in TIME_WAIT, this amounts to having to wait until you can restart a server on the same port.

Note that if a listening connection has no connections, the port can be safely closed and re-used immediately by the same sort of server. Take a look at SO_REUSEADDR (basically, if the local_address/port combination you try to bind to is in TIME_WAIT, you can reuse it. If it is in another state (probably 'in use'), it will still fail).

See also http://hea-www.harvard.edu/~fine/Tech/addrinuse.html


TCP_NODELAY and Nagle's Algorithm

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Without Nagle's algorithm, each TCP socket write() becomes a packet on the network (or, if larger than a single packet can be, multiple packets).


This means data is sent as soon as possible, which is great for interactive things.

For example, remote shells (like SSH) and game netcode probably want Nagle off.


The worst case behaviour is sort of bad, though. If you manage to write() single bytes, then every TCP packet will contain 1 byte of payload, wrapped in over forty bytes of networking header, and each packet needs its ow acknowledgement packet as well.

If these packets happen at the rate that you hit your keyboard at, then no one much will notice or care, but if you happen to send a lot of data with small writes

However, if a program is sending things in unnecessarily small chunks, not having Nagle is somewhat inefficient:

  • it can lead to unnecessarily high NIC interrupt load (and implied CPU load, probably kernel time)
which is one among a few reasons you may never reach the top NIC speed
  • you can more easily congest (e.g. oldschool POTS modems could congest more easily even when you're barely sending any (user) data. ...but this is rarely relevant anymore)
  • you can slow down just your own connection due to limited amount of packets in flight (read up on ACKing strategiesm, TCP windows)
which also lowers your transfer speeds
  • switches doing more routing decisions (leads to slightly higher latency), more packet overhead (sizewise), and some other overheads

Note that most of these effects are small.


With Nagle's algorithm, smaller writes are coalesced small writes into larger packets - also meaning fewer packets (and fewer ACKs).

Nagle is basically it's a fix for network code that use smaller write than they should, and then can actually increase bandwidth for moderately bulky transfers but increase latency for interactive stuff.

It's on by default.


It does bother many realtime systems, which is why you might want to turn it off.

It can also interact badly with delayed-ACK mechanisms - it apparently causes high latency in some circumstances.(verify)




More specifically

Nagle's algorithm says you may not send a runt if you currently have unacknowledged data in flight.

That definition means it must wait until it either

all data in flight is ACKnowledged (this also acts as a sort of timeout)
it has enough data that it could't cram more into a single packet (relates to MSS)

This means a mass of small writes generally get coalesced into a single packet, and isolated small writes get written soon enough.


Note that:

  • a one-sided transfer, e.g. a download, will have its last packet delayed longer than the rest (until the ACKs happen)
benchmarkers beware.
  • Nagle is not necessary for any app which is aware it should send large packets
e.g. those that send things in larger chunks
  • Nagle is counterproductive for latency-sensitive things
if small writes must arrive as soon as possible, the app will probably disable it.
TCP_NODELAY disables Nagle's algorithm
  • a good example of the previous is remote shells
yes, they may send single bytes for keypresses
Yet it's the fastest way to get that key acknowledged and on the screen,
and the latest network device that cared about rate of packets at typing speed was roughly the 14.4K POTS modem.(verify)


Note that even in the cases where it helps in general, the last packet will tend to sit on the sending side longer than the rest - until the previous gets ACKed(verify). Which is soon enough (longer with delayed ACKs, see below).






See also:

-->

UDP

https://en.wikipedia.org/wiki/Internet_Group_Management_Protocol

Connection errors

Connection refused

IPv6

IPv6 multicast

MLD

Multicast Listener Discovery (MLD) is basically the IPv6 equivalent to IGMP

https://en.wikipedia.org/wiki/Multicast_Listener_Discovery


IPv6 use

If it's existed for 20 years, are we using it yet?

IPv6 address notation

IPv6 addresses are 128-bit numbers. RFC 1884 specified that everything should recognize the following forms:

  • full form: eight chunks of 16-bit hexadecimal numbers, like FE12:4567:3C23:4984:0011:0000:0000:0006
  • abbreviated form, based on the two rules that:
    • leading zeroes in a block can be removed. 0000 can be 0, 0123 can be 123, etc.
    • :: means "fill with 0000 blocks" which can appear at any point (beginning, middle, end), but at most once (and then typically/always the leftmost block of zeroes(verify)), to avoid ambiguities
  • alternative form: a convenience form in which you specify the last two 16-bit blocks as four decimal numbers (for specifying IPv4 addresses in IPv6 notation) while leaving the rest as hex:
0:0:0:0:0:FFFF:12.127.65.3 
  which can be abbreviated as
::FFFF:12.127.65.3
  and is also just another way of writing
::FFFF:C7F:4103

IPv6 special addresses and ranges

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
IP/Network           Purpose
(abbreviated, CIDR)  
-------------------- --------------
::/8                 Unassigned, but contains special addresses:
::1/128              * loopback
::/128               * unspecified/any  (all zeroes, like 0.0.0.0 in IPv4)
::/96                * IPv4 compatible addresses (obsolete)
::ffff:0:0/96        * IPv4-mappable addresses

FE80::/9             Private ranges:
FE80::/10            * link-local unicast 
                       (similar to IPv4's 169.254.0.0/16)
FEC0::/10            * site-local unicast
                       (similar to IPv4's 10/8, 172.16/12 and 192.168/16)

FC00::/7             Unique Local Addresses, similar in purpose to FECO::/10
                     This includes two ranges, FC00::/8 and FD00::/8, 
                     which can be used for /48 subnets 
                     that are only site-routable
                     (40 bits are to be filled in randomly, to avoid 
                      collision should such nets ever merge)
2002:0000:/16        6to4


FF00::/8             Multicast

2::/7                Reserved for NSAP address allocation

...and a number of unassigned ranges.



See also:


IPv6 addresses may be 128-bit, but the amount of things you hand out is actually more like 48-bit

IPv6 general use addresses and ranges

IPv4-mapped addresses, 6to4, and Teredo