Packet, Where Are You: DNS Service Discovery for Kerberos Failed Because of Wrong MTU Size in Kubernetes Cluster

The problem

Recently I had a very tricky network problem at work that involved Kerberos, DNS and Kubernetes. This article is a writeup of the analysis and the measures we took to solve the issue.

In my team at work we’re running most of our applications in Kubernetes clusters on AWS (the EKS service). One of our applications uses Kerberos authentication against our corporate Active Directory. As we didn’t want to statically configure any KDCs (Key Distribution Center, basically a part of the Kerberos server) for the realms, Kerberos uses DNS service discovery to find the KDCs. This means it uses the service name _kerberos._tcp.REALM.EXAMPLE.COM to dynamically get a list of KDCs for a specific realm from the DNS server(s).

The problem first manifested itself in our application being unable to access a certain web service that is protected by Kerberos authentication. The error message (returned by the GSSAPI that our application uses) was "Cannot find KDC for requested realm". This hinted at a problem with the DNS service discovery described before. However, the application could access other web services that also use Kerberos for authentication, so Kerberos and DNS seemed to work in general. But then we remembered that our company uses two Kerberos realms (for whatever reason) and we saw that the failing web service belonged to one realm while the working ones belonged to the other. To confirm that DNS service discovery was indeed the problem, we ran dig commands like dig +noall +answer SRV _kerberos._tcp.REALM.EXAMPLE.COM in a container in the Kubernetes cluster for both realms, and as we had expected, dig could resolve one realm but not the other. We first suspected of course that there might by a problem on the server side, that is with the Active Directory or the DNS server. But then we ran the same dig commands on a machine outside of the cluster and, to our surprise, dig could resolve both realms. This meant that the problem had to be somehow related to Kubernetes, but left us quite puzzled.

To get an idea what was really going on, I ran tcpdump on the cluster nodes. This showed three interesting things, which can be seen in the screenshot below. ^[1]

Figure 1. Network trace showing the DNS request and answer and the ICMP Destination Unreachable (Fragmentation needed) message (Cluster node = 10.1.79.36, DNS server 10.1.70.201, pod = 192.168.168.115)

The DNS server uses UDP (and not TCP) and its answers are each sent in one UDP datagram and therefore also in one IP packet. These IP packets have the Don’t fragment flag set.
The size of the answer to the query for the second realm (2318 octets) exceeds the MTU size that was configured on all network interfaces in the cluster (1500 octets, the typical size for Ethernet).
ICMP Destination Unreachable (Fragmentation needed) messages are sent from the cluster nodes to the DNS server.

The solution

So it seemed pretty clear what the root cause of the problem was and how we could fix it: We just needed to increase the MTU size of the network interfaces in the cluster. But how do you do that? There is a DaemonSet aws-node running on each EKS cluster that is part of the AWS VPC CNI. It’s responsible for managing the Elastic Network Interfaces (ENI) on the cluster nodes and for assigning IP addresses to pods. As it turns out, this DaemonSet has an environment variable AWS_VPC_ENI_MTU to configure the MTU size to use for the ENIs. So it seemed we could just set this variable to a higher value, like 9001 octets (the default value for Ethernet jumbo frames). And this is what we ended up doing eventually. But as the clusters are managed by another group in our company, it took us a while to get this variable changed permanently. Therefore we had to implement a workaround first.

For this workaround one of my colleagues was really helpful. He pointed out that we wouldn’t run into the problem if we used TCP instead of UDP for the DNS service discovery (as the network stack of the OS partitions the data to be sent into TCP segments that each fit into one IP packet, so the DNS server’s answer that was larger than the MTU of the network interface would be split into two TCP segments). I hadn’t thought of this before but when I tried it using dig 's +tcp option it of course worked. Unfortunately for use, we couldn’t force the Kerberos client (or the resolver of the OS or CoreDNS or any other component that is involved in DNS) to use TCP. ^[2]. But then my colleague had another good idea: Could we run dig with the +tcp option somewhere in the cluster periodically to resolve the second realm, so that the answer would already be in the cache when the Kerberos client comes along and tries to resolve it? We tried it out and it actually worked, ^[3] so we implemented it with a sidecar container in our application’s pod.

A more detailed analysis

Although we had a workaround and a final solution, I was still not completely satisfied. I wondered which Kubernetes component was actually dropping the DNS server’s answers and sending the ICMP messages and soon realized that I had no idea how the networking in Kubernetes actually works. Luckily I found an excellent guide on the internet. Through this guide I learned that Kubernetes uses standard network components of the Linux kernel like physical interfaces, virtual Ethernet devices (e. g. the ENIs mentioned above) and bridges. So it seemed to me that the kernel itself had to be responsible for the things we had observed.

Then I thought it would be really cool to actually be able to see the path an IP packet, in this case the DNS server’s answer, takes through the Linux kernel, from the network interface it arrives at to the process that, well, processes it. And lo and behold, after some googling I found that a tool exists to do that: pwru, which stands for Packet, where are you?. What it does is it uses eBPF to instrument the kernel’s network stack and traces the way a certain packet (matching criteria like IP address, source / destination port and so on) takes through the kernel. The output you get is a list of kernel functions that were called when this packet was processed. Of course, to really make sense of the output, you should be at least somewhat familiar with the internals of the kernel’s network stack. I am certainly not, but I thought that maybe the names of the kernel functions would give a hint anyway, so I decided to give the tool a spin and try to analyze our problem with it.

I ran it twice on the cluster node that hosted the pod / container which I ran the dig command in and filtered out the packet that contained the DNS server’s answer. The first run was with an MTU size of 1500, so the packet was dropped. Then I increased the MTU size to 9001, which means that the packet made it back to dig. The two listings below show the (edited) output of pwru (the exact command was pwru --all-kmods --output-meta 'udp and src port 9053').

Listing 1. Output of pwru (edited), MTU = 1500 octets in the cluster

...
              ip_forward netns=4026531992 mark=0x0 ifindex=5 proto=8 mtu=1500 len=2379
             __icmp_send netns=4026531992 mark=0x0 ifindex=5 proto=8 mtu=1500 len=2379
...

Listing 2. Output of pwru (edited), MTU = 9001 octets in the cluster

...
              ip_forward netns=4026531992 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2379
...
               ip_output netns=4026531992 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2379
...
                netif_rx netns=4026533409 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2379
...
                  ip_rcv netns=4026533409 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2379
...
                 udp_rcv netns=4026533409 mark=0x0 ifindex=3 proto=8 mtu=9001 len=2359
...
         skb_consume_udp netns=0 mark=0x0 ifindex=0 proto=8 mtu=0 len=2351

In both cases the trace started with the same 22 kernel functions, of which I show only the last in the listings, ip_forward. After this function the traces looked very different, so it seemed to me that it might be the one worth looking at in more detail. As I said before, I am by no means familiar with the network stack of Linux, but when I looked at the source code of this function I found that I could follow it quite easily nonetheless. In line 136 the size of the packet that is about to be forwarded is checked against the MTU of the route for this packet (the function ip_exceeds_mtu). If the packet is larger than the MTU the exact same ICMP message that I saw in the network trace (Destination Unreachable (Fragmentation needed)) is sent (by the function icmp_send) and the packet is dropped. So pwru helped me locate the source of our problem down to a single line in the kernel code, which I think is really cool :-) ^[4]

In the trace for the second case (MTU set to 9001 octets) ip_forward was followed by a lot more functions, of which in the listing I only show a few that I thought indicate that the packet was actually delivered to dig. But I didn’t investigate this case further, so I’m not going to go into any details here.

Conclusion

Although I was annoyed at first because we had to debug just another problem with Kubernetes (my team is in the process of migrating some applications from dedicated servers to Kubernetes and this is not the first problem we had), in the end I really enjoyed it and of course learned a lot. First of all, this problem reminded me that issues at the application level can be caused by things deep down in the network. I think I’ve not even thought about MTU size in years, let alone configured it. I wouldn’t have imagined that such a low-level thing could cause Kerberos authentication to fail.

Then this problem was a good opportunity to learn more about Kubernetes networking which I had had basically not clue about before. Finally I added a new tool to my toolbox: pwru. I think it might be useful in the future, as I tend to be the guy at work who gets called when the network is causing trouble.

I hope you liked this article. Fee free to reach out to me if you have questions or comments.

1. To tell the truth, the screenshot doesn’t show the network trace from one of the cluster nodes. Instead it’s a trace taken on a machine outside of the cluster, running a simple home-grown DNS server (that I already had written for another project, in Python using Scapy), which I used to reproduce the problem more easily. For some reason unknown to me, on the cluster node I only saw the ICMP messages but not the DNS requests and answers.

2. Actually, there might have been a way, but we didn’t pursue this option further.

3. In reality it doesn’t work 100% of the time because there is a race condition. The cached answer (I assume it’s CoreDNS that does the caching but I haven’t really verified) has an associated TTL (30 seconds in our case) and it can happen that the Kerberos client tries to resolve the realm after the answer has expired in the cache but before the next periodic run of dig (every 10 seconds). But we deemed this solution good enough as a workaround.

4. Of course this is not a bug in the kernel but a result of the packet being too big and / or the MTU size to small, but I think you know what I mean.