Tuesday, August 23, 2011

The Problem with PMTUD

Before we get down and dirty into the problems with PMTUD, let's quickly go over what PMTUD is.

PMTUD stands for Path Maximum Transmission Unit Discovery and it is a protocol/algorithm defined in RFC 1191 that determines the best packet size for IPv4 datagrams flowing between any two given hosts.[1] In this way, it attempts to optimize traffic through the Internet by using the largest possible datagram that doesn't require intermediary routers to fragment the traffic (since fragmentation and, more importantly, reassembly are expensive operations for routers to perform).

It works like this... if I'm a host (say an FTP server) who wants to send the largest packet I possibly can to another host (say an FTP client), I will start with what I know, which is the MTU of my underlying ethernet interface (normally 1500 bytes). I will then send this packet out with the Don't Fragment (DF) bit set in the IPv4 header. If a router in the path would need to fragment this packet in order to send it along to the next hop (because its outbound interface was, say 1450 bytes), it will send back an ICMP reply indicating that fragmentation was needed, but the DF bit was set.[2] The implication is that the packet was dropped. In this ICMP reply, it also specifies what the MTU should be (in this case 1450 bytes). The originating host will then resend a packet of this size. This can happen multiple times if, for example, another router later on down the path has an even smaller MTU on its outbound interface. Eventually the packet gets to the destination host using a size that is the most optimal for the path and all subsequent 'large' packets will also be this optimal size.

Nowadays, PMTUD often comes into play when we're talking about tunneling traffic. Since tunnels require that some extra headers be inserted, they have an effective MTU that is the original ethernet MTU less the size of the additional headers.[3]

There are a few common causes for not being able to get the ICMP replies necessary for PMTUD to work. Overzealous network administrators will configure their firewalls to drop all ICMP since certain ICMP messages are considered security threats. Routers are sometimes (mis) configured with PMTUD disabled and so will simply drop the packet without sending the required ICMP message.[4]

As we can see, the ICMP messages that PMTUD requires to work can be kept from the originating host due to various factors beyond the host's control, and often beyond the control of even the network administrators responsible for the network the host is directly attached to.

So, what's the problem with PMTUD. The real problem with PMTUD is that it fails closed. This is anathema to the Robustness Principle, and it is a well understood problem by the IETF as evidenced by RFC 2923.

PMTUD, as documented in RFC 1191, fails when the appropriate ICMP messages are not received by the originating host. The upper-layer protocol continues to try to send large packets and, without the ICMP messages, never discovers that it needs to reduce the size of those packets. Its packets are disappearing into a PMTUD black hole.

The most common symptom of hitting a PMTUD black hole is that some, but not all traffic seems to make it between hosts. A TCP connection may be established for a file transfer and then stall out. If the ICMP messages are only being dropped in one direction, the same file transfer in the other direction might work just fine.

RFC 1191 doesn't speak to what the host implementation should do if it fails to receive protocol packets. It leaves that as an exercise to the upper layer protocol implementor. Is this the fault of the RFC 1191 contributors? Probably. They should have dictated that, by default, when exceeding an implementation specific timeout, upper layer protocols MUST cease to set the DF bit in packets that they send.

RFC 2923 basically says as much by suggesting that TCP do this as a fix for the issue.

TCP should notice that the connection is timing out. After several timeouts, TCP should attempt to send smaller packets, perhaps turning off the DF flag for each packet.

IPv6 takes the problem completely out of the hands of the routers by failing hard open. Again, from RFC 2923:

Note that, under IPv6, there is no DF bit -- it is implicitly on at all times. Fragmentation is not allowed in routers, only at the originating host. Fortunately, the minimum supported MTU for IPv6 is 1280 octets, which is significantly larger than the 68 octet minimum in IPv4. This should make it more reasonable for IPv6 TCP implementations to fall back to 1280 octet packets, when IPv4 implementations will probably have to turn off DF to respond to black hole detection.

RFC 2923 opines that:

This creates a market disincentive for deploying TCP implementation with PMTUD enabled.

However, host networking stacks did implement PMTUD, but allowed it to fail closed. Since RFC 2923, host networking stacks have implemented variations of its suggestions. The Linux networking stack, as of 2.6.17 mainline, implemented TCP MTU probing described in RFC 4821, although it is not enabled by default.[5] The Windows networking stack has implemented PMTUD black hole detection, and recently made it the default in Win2k8, Vista and Windows 7.[6]

So, this leaves us still living with PMTUD blackholes. I often see this when we connect our cloud networks, via an IPSec VPN, to our customer's corporate networks. Many of these customer networks are large and managed by various different groups; often there is a router or firewall that blocks all ICMP or simply doesn't participate in PMTUD. The hosts tend to be heterogenous, consisting of various versions of Windows and different *NIXs and so some of them will detect PMTUD blackholes and others won't. This makes debugging difficult.

As a workaround, routers and tunnel endpoints have implemented MSS (Maximum Segment Size) clamping, which utilizes a part of TCP whereby the largest acceptable payload size (the MSS) is exported to the remote TCP connection endpoint as an option during the TCP connection handshake. The MSS is orthoganally bidirectional, meaning the MSS for one half of the connection can be different than the MSS of the other.[7]

Normally the MSS is set by the originating host based on the MTU of its outbound interface. However, an intermediary router/firewall can modify the MSS based on the knowledge it has about the MTU of its interfaces. If the destination interface is one that has an MTU too small for the MSS it sees in the TCP SYN packet, it will rewrite the MSS option.[8] In this way, intermediary routers attempt to fix PMTUD issues for TCP connections.[9]

If you suspect that you might be the victim of a PMTUD black hole, the easiest way to validate this is by forcing the MTU of the failing host down to some relatively low value like 1300. [10] Doing so will cause TCP to negotiate an MSS appropriate to that size. Retry your failing file transfer (or other network operation), and if it succeeds, you most likely have hit a PMTUD black hole. At this point, it's best to get your network administrator involved to help track down exactly where the black hole is, whether it can be fixed or not, and whether or not there is a router between the the two networks that supports MSS clamping to work around the issue.




[1] IPv6 has its own RFC for PMTUD, RFC 1981. This article only deals with IPv4 PMTUD.
[2] ICMP type 3, code 4: Destination Unreachable, Fragmentation required and DF flag set
[3] For example, GRE requires 24 bytes for its headers. Therefore, the effective MTU of a GRE tunnel (over ethernet) will be 1476.
[4] IPSec VPN tunnels pose an additional, more insidious, problem with PMTUD. If the variation of IPSec is policy based, the policy may only allow traffic from certain subnets through the tunnel. If an intermediary router with a smaller MTU exists on the other side of the VPN from a host and that router's interface address is not in the allowable subnets dictated by the IPSec policy, any ICMP fragmentation-needed messages sent by that router back to the host on the other end of the IPSec VPN will be dropped.
[5] Can be enabled by setting /proc/sys/net/ipv4/tcp_mtu_probing to '1'
[6] Can be enabled by setting the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\EnablePMTUBHDetect to '1'.
[7] This is distinguished from the MTU which is the same in both directions. This distinction becomes very important when debugging PMTUD blackholes.
[8] The router may actually have better knowledge than just the destination interface MTU. It may have engaged in PMTUD on its own and may know the PMTU to the destination host and can modify the MSS to be based on the the PMTU instead. This is, accordingly, referred to as "Clamping MSS to PMTU". This only works, of course, if PMTUD isn't broken for that segment as well.
[9] This breaks when, somewhere on the path after the destination interface, the MTU shrinks, since the negotiated MSS will no longer fit through the smaller pipe. In the case of IPSec VPNs across the internet, however, the path MTU between the two endpoints is almost always going to be at least 1500, so carving off the IPSec overhead, and adjusting the MSS apropriately should allow traffic to pass.
[10] For example, on a Linux machine with interface 'eth0' this can be done with the command ifconfig eth0 mtu 1300. If you have a Windows 7 machine with an interface called 'Local Area Connection', you would use netsh interface ipv4 set subinterface “Local Area Connection” mtu=1300 store=persistent and then reboot.