I spent the last couple of days building out the production infrastructure for my new project since the MVP code was ready to go live. It was also my first foray into putting Azure's Managed Kubernetes (AKS) offering into production. While it was much smoother going than AWS's EKS offering, I still ran into my fair share of bumps and bruises.

This problem, however, was easily the most frustrating because even spending hours on calls and screen sharing sessions with the support team, it took a couple of days to chase this down.


Problem

I deployed our codebase and were seeing requests that locally took about 50ms on regular VMs take over 5s when running on AKS.

Keep in mind, this was an off-the-shelf AKS cluster, with just a few, unrelated changes so far:

  • I customized the node sizes, but otherwise made no changes to kube-system.
  • I just created a namespace and deployed the app to it to verify that everything was working correctly.

This 5-second delay we were seeing, it was intermittent, but when it happened, it was always at least 5 seconds, never less. We quickly narrowed it down to DNS resolution, sometimes taking about 5 seconds.


Solution

We ended up with two solutions to this but ultimately opted for the second one as it had fewer drawbacks for this project.

Option 1

Set the dnsPolicy to Default instead of ClusterFirst.

dnsPolicy: Default

This essentially changes the DNS routing to use the pod-level DNS configuration instead of the cluster's DNS service.

We opted not to use this as it starts breaking other functionality like internal service discovery and load balancing while also requiring us to manage any unique DNS configurations (like search domains) at a pod-level instead of a cluster-level.

Option 2

Enable single-request-reopen in the dnsConfig.

dnsConfig:
    options:
    - name: single-request-reopen

This overwrites the DNS resolver settings for the container not to keep listening for lost packet from the A or AAAA requests (see below for the root cause and why this fixes it).

We opted for this (despite it not working for Alpine-based images) because we don't use Alpine-based images for this project, and it was the least invasive solution for us.


Background

If you search for this behavior, one of the first things you find is a GitHub thread that started in 2017 and is still active today, which outlines these same symptoms with various solutions, none of which seem to be a guaranteed solution.

What has collectively been narrowed down over the last few years is the actual root cause which is eloquently explained in this Weave blog post: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts