Although we had a workaround and a final solution, I was still not completely satisfied. I wondered which Kubernetes component was actually dropping the DNS server’s answers and sending the ICMP messages and soon realized that I had no idea how the networking in Kubernetes actually works. Luckily I found an excellent guide on the internet. Through this guide I learned that Kubernetes uses standard network components of the Linux kernel like physical interfaces, virtual Ethernet devices (e. g. the ENIs mentioned above) and bridges. So it seemed to me that the kernel itself had to be responsible for the things we had observed.
Then I thought it would be really cool to actually be able to see the path an IP packet, in this case the DNS server’s answer, takes through the Linux kernel, from the network interface it arrives at to the process that, well, processes it. And lo and behold, after some googling I found that a tool exists to do that: pwru
, which stands for Packet, where are you?. What it does is it uses eBPF to instrument the kernel’s network stack and traces the way a certain packet (matching criteria like IP address, source / destination port and so on) takes through the kernel. The output you get is a list of kernel functions that were called when this packet was processed. Of course, to really make sense of the output, you should be at least somewhat familiar with the internals of the kernel’s network stack. I am certainly not, but I thought that maybe the names of the kernel functions would give a hint anyway, so I decided to give the tool a spin and try to analyze our problem with it.
I ran it twice on the cluster node that hosted the pod / container which I ran the dig
command in and filtered out the packet that contained the DNS server’s answer. The first run was with an MTU size of 1500, so the packet was dropped. Then I increased the MTU size to 9001, which means that the packet made it back to dig
. The two listings below show the (edited) output of pwru
(the exact command was pwru --all-kmods --output-meta 'udp and src port 9053'
).
Listing 1. Output of pwru
(edited), MTU = 1500 octets in the cluster
...
ip_forward netns=4026531992 mark=0x0 ifindex=5 proto=8 mtu=1500 len=2379
__icmp_send netns=4026531992 mark=0x0 ifindex=5 proto=8 mtu=1500 len=2379
...
Listing 2. Output of pwru
(edited), MTU = 9001 octets in the cluster
...
ip_forward netns=4026531992 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2379
...
ip_output netns=4026531992 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2379
...
netif_rx netns=4026533409 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2379
...
ip_rcv netns=4026533409 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2379
...
udp_rcv netns=4026533409 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2359
...
skb_consume_udp netns=0 mark=0x0 ifindex=0 proto=8 mtu=0 len=2351
In both cases the trace started with the same 22 kernel functions, of which I show only the last in the listings, ip_forward
. After this function the traces looked very different, so it seemed to me that it might be the one worth looking at in more detail. As I said before, I am by no means familiar with the network stack of Linux, but when I looked at the source code of this function I found that I could follow it quite easily nonetheless. In line 136 the size of the packet that is about to be forwarded is checked against the MTU of the route for this packet (the function ip_exceeds_mtu
). If the packet is larger than the MTU the exact same ICMP message that I saw in the network trace (Destination Unreachable (Fragmentation needed)) is sent (by the function icmp_send
) and the packet is dropped. So pwru
helped me locate the source of our problem down to a single line in the kernel code, which I think is really cool :-)
In the trace for the second case (MTU set to 9001 octets) ip_forward
was followed by a lot more functions, of which in the listing I only show a few that I thought indicate that the packet was actually delivered to dig
. But I didn’t investigate this case further, so I’m not going to go into any details here.