Troubleshooting Azure DNS Timeouts

We recently completed a large “Lift and Shift” of our on-premises application hosting into the Microsoft Azure cloud. In this post, I will describe one of the problems we encountered, troubleshot, and resolved on the new Windows Virtual Machines that were now hosting our C# applications.

The Problem

Our applications would generate network-related errors at random when running in the new environment. We could go days without seeing an error from any particular application, but there were always some errors – no rhyme nor reason across the network as a whole.

Examples

When connecting to our MySql instances, we might have seen:

MySqlConnector.MySqlException (0x80004005): Unable to connect to any of the specified MySQL hosts.
  at MySqlConnector.Core.ServerSession.OpenTcpSocketAsync
  at MySqlConnector.Core.ServerSession.ConnectAsync

When making HTTP requests we might have seen:

System.Net.WebException: The remote name could not be resolved: 'example.com'
  at System.Net.HttpWebRequest.EndGetResponse

or

System.Net.Http.HttpRequestException: No such host is known. (example.com:443)
 ---> System.Net.Sockets.SocketException (11001): No such host is known.
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync

These errors would occur for host names that absolutely existed and were readily accessible otherwise. The specific host did not matter–these errors occurred for hosts we controlled as well as those controlled by third parties.

The main trouble was that the problem was just so rare, yet persistent; specific, yet pernicious. How could we reproduce it? Was our application code doing something wrong, or was it the C# framework, or could any request from those VMs reproduce it? Was it DNS, or was it the request after DNS resolved? If it was DNS, was it specific to the way we set up our DNS hosts or virtual network? If it was the request, was it the source-server, was it the network, or was it the destination server? Was it load-related, on the source VM or on some network bottleneck or in Azure DNS itself?

Troubleshooting

We were eventually able to reproduce the errors outside of running apps using Powershell scripts running on our Azure VMs (it never reproduced outside of Azure). In the first versions of the script, it needed to run for a few days before we could know whether we had managed to reproduce the problem.

By testing many variants of the script, we were able to narrow the problem down to DNS resolution itself. I was then able to get the “Mean-Time-to-Reproduce” down to “randomly between two to six hours” by having the script clear the DNS cache between each attempt.

This is the simplest version of the Powershell we used to reproduce the problem:

Write-Host "Start: $(Get-Date -Format "o")"
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
do {
    Start-Sleep -Seconds 1
    Clear-DnsClientCache
    $time = (Measure-Command {
        try {
            $result = [System.Net.Dns]::GetHostAddresses("azure.microsoft.com")
        } catch {
            Write-Host $_
            $result = $False
        }
    }).TotalSeconds
} while ($result);

Write-Host "$time-second timeout after running for $($stopwatch.Elapsed.TotalMinutes) minutes at $(Get-Date -Format "o")"

When it finally encountered the error, the DNS resolution would time out after about 11 seconds. The script output looked something like this:

Start: 2023-01-04T18:39:37.8532468+00:00
Exception calling \"GetHostAddresses\" with \"1\" argument(s): \"No such host is known\"
12.1075537-second timeout after running for 65.1770462016667 minutes at 2023-01-04T19:44:48.8129523+00:00

We now knew the problem wasn’t specific to our application code.

From there, we discovered that when using an entirely fresh Azure Windows VM in in an entirely fresh VNet using Azure’s default DNS provider, we could run this script and reproduce the problem. This meant it wasn’t (necessarily) a problem with our “Lift and Shift” network or our own Faithlife DNS configuration. Further, we could create two reproduction environments, one using Azure’s built-in DNS and one using Google’s 8.8.8.8 DNS. The Google environment did not reproduce the problem.

This narrowed the problem down decisively: it occurred on a default-configuration Microsoft Windows VM on a default-configuration Microsoft Azure Virtual Network trying to use default-configuration Microsoft Azure DNS to resolve azure.microsoft.com.

At this point, we escalated our Azure support ticket. The new rep spotted something interesting when we reproduced the problem and got them a packet capture! They noticed that the DNS request which failed happened to use UDP Source Port 65330 when talking to Azure DNS – and if you dig deep into the VNet documentation, you will discover that UDP Port 65330 is reserved for Azure’s own use.

The final script to reproduce the problem looks like this:

> dns-timeouts.sh
#!/usr/bin/env bash

# Azure subscription ID to use
subscription="$1"

# Azure location (e.g. "westus3")
location="$2"

# When not provided, sets up a new reproduction environment.
# When provided, fetches results of previous run.
suffix="$3"

set -e +H

if test -z "${subscription}" -o -z "${location}"; then
  echo "Usage: \`${0} subscription location [suffix]\`"
  exit 1
fi

if test -z "${suffix}"; then
    suffix=$RANDOM
    create=1
else
    create=0
fi

resource_group="dns-rg-$suffix"
azure_vnet="dns-vnet-azure-$suffix"
google_vnet="dns-vnet-google-$suffix"
azure_vm="dns-vm-azure-$suffix"
google_vm="dns-vm-google-$suffix"
storage_account="dnssa$suffix"
packet_capture="dns-capture-$suffix-$RANDOM"

if test $create -eq 1; then

    echo "Creating networks..."

    az group create --subscription "$subscription" --location "$location" \
    --name "$resource_group"

    azure_subnet_id=$(az network vnet create --subscription "$subscription" --location "$location" --resource-group "$resource_group" \
    --name "$azure_vnet" --subnet-name test \
    -otsv --query "newVNet.subnets[0].id" | tr -d '\r')

    google_subnet_id=$(az network vnet create --subscription "$subscription" --location "$location" --resource-group "$resource_group" \
    --name "$google_vnet" --subnet-name test --dns-servers "8.8.8.8" \
    -otsv --query "newVNet.subnets[0].id" | tr -d '\r')

    echo "Creating VMs..."

    # If you get an error about `--subnet_id` being invalid, try setting `export MSYS_NO_PATHCONV=1` to disable path expansion
    az vm create --subscription "$subscription" --resource-group "$resource_group" --location "$location" \
        --name "$azure_vm" --computer-name "dns-a-$suffix" --subnet "$azure_subnet_id" \
        --image Win2022AzureEditionCore --size Standard_B1s \
        --admin-username azureuser --admin-password "12345abcde!ABCDE" --public-ip-address "" --nsg "" --public-ip-sku Basic \
        --nic-delete-option delete --os-disk-delete-option delete --data-disk-delete-option delete

    az vm create --subscription "$subscription" --resource-group "$resource_group" --location "$location" \
        --name "$google_vm" --computer-name "dns-g-$suffix" --subnet "$google_subnet_id" \
        --image Win2022AzureEditionCore --size Standard_B1s \
        --admin-username azureuser --admin-password "12345abcde!ABCDE" --public-ip-address "" --nsg "" --public-ip-sku Basic \
        --nic-delete-option delete --os-disk-delete-option delete --data-disk-delete-option delete

    echo "Installing Network Watcher..."

    az vm extension set --subscription "$subscription" --resource-group "$resource_group" --vm-name "$azure_vm" --publisher Microsoft.Azure.NetworkWatcher --name NetworkWatcherAgentWindows --version 1.4

    echo "Creating Storage Account..."

    az storage account create --subscription "$subscription" --resource-group "$resource_group" --location "$location" \
        --name "$storage_account" \
        --sku Standard_RAGRS \
        --kind StorageV2

    echo "Starting packet capture..."

    az network watcher packet-capture create --subscription "$subscription" --resource-group "$resource_group" --vm "$azure_vm" --name "$packet_capture" --storage-account "$storage_account" \
      --filters "[{\"protocol\":\"UDP\", \"remoteIpAddress\":\"168.63.129.16\"}]"

    echo "Running test..."

    # The RunPowerShellScript command will "time out" after about an hour, but the script will keep running on the VM until it exits.
    test_command='
Write-Host "Start: $(Get-Date -Format "o")"
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
do {
    Start-Sleep -Seconds 1
    Clear-DnsClientCache
    $time = (Measure-Command {
        try {
            $result = [System.Net.Dns]::GetHostAddresses("azure.microsoft.com")
        } catch {
            Write-Host $_
            $result = $False
        }
    }).TotalSeconds
} while ($result);

Write-Host "$time-second timeout after running for $($stopwatch.Elapsed.TotalMinutes) minutes at $(Get-Date -Format "o")"
'

    az vm run-command invoke --subscription "$subscription" --resource-group "$resource_group" \
        --name "$azure_vm" \
        --command-id RunPowerShellScript --no-wait --scripts "$test_command"

    az vm run-command invoke --subscription "$subscription" --resource-group "$resource_group" \
        --name "$google_vm" \
        --command-id RunPowerShellScript --no-wait --scripts "$test_command"

    # Wait for completion. To fetch results, run script again.
    echo "Test started. Please run \`${0} $subscription $location $suffix\` to retrieve results after letting script run for at least five hours."

else

    echo "Stopping packet capture..."
    az network watcher packet-capture stop --subscription $subscription --name $packet_capture --location $location

    echo "Fetching results..."

    # Enumerates the directory where Azure keeps the output of previous RunPowerShellScript commands
    result_command='
Get-ChildItem -Path C:\Packages\Plugins\Microsoft.CPlat.Core.RunCommandWindows\1.1.15\Status `
  | Get-Content | ConvertFrom-Json `
  | % {
    $a = @{
      Step = $_.status.name;
      Timestamp = $_.timestampUTC
    };

    $_.status.substatus `
     | where { $_.name -eq "StdErr" -or $_.name -eq "StdOut" } `
     | Foreach { $a[$_.name] = $_.formattedMessage.message };

    [pscustomobject] $a;
  } `
  | Sort-Object -Descending -Property Timestamp | Select-Object -Skip 0 -First 3 | Format-List'

    echo ""
    echo "Azure DNS:"
    printf '%b\n' "$(az vm run-command invoke --subscription "$subscription" --resource-group "$resource_group" \
        --name "$azure_vm" \
        --command-id RunPowerShellScript --scripts "$result_command" --query "value[].message | join('', @)")"

    echo ""
    echo "Google DNS:"
    printf '%b\n' "$(az vm run-command invoke --subscription "$subscription" --resource-group "$resource_group" \
        --name "$google_vm" \
        --command-id RunPowerShellScript --scripts "$result_command" --query "value[].message | join('', @)")"

    echo ""
    echo "To clean up, \`az group delete --yes --no-wait --subscription $subscription --resource-group $resource_group\`."
fi

Resolution

So this was the situation:

  • When you make a DNS request, Windows will choose a “Dynamic Port” to use for your traffic.
    • The Microsoft-provided default range for Azure Windows images is ports 49152 through 65535, which has 16383 ports to choose from, including port 65330.
    • When a DNS request times out and retries, it will re-use the same dynamic “Source Port” until it gives up entirely (after 11 seconds).
  • So, one out of 16,383 DNS requests would randomly fail because Azure blocked the request entirely when using port 65330. At one request per second, that’s about four hours “Mean-Time-to-Reproduce” (which matches what we were seeing).
  • Meanwhile, Google DNS was working because it was using TCP, not UDP, for requests routing outside of the VNET.

Which finally brings us to the fix: Tell Windows not to use the port that Microsoft doesn’t want us using in Azure. You do so by giving it a “range” of 65330 - 1 - 49152 = 16177 ports to choose from instead of the full 16383.

We rolled this fix out to all of our Windows VMs using Puppet:

exec { 'Disable UDP/65330':
  command  => 'Set-NetUDPSetting -DynamicPortRangeStartPort 49152 -DynamicPortRangeNumberOfPorts 16177',
  unless   => 'exit [int](Get-NetUDPSetting | % { $_.DynamicPortRangeStartPort + $_.DynamicPortRangeNumberOfPorts -gt 65330 })',
  provider => powershell,
}

We’ve suggested that Microsoft consider correcting their base Windows images so that they do not default to using Azure-incompatible ports or, if that is infeasible, improving the Azure documentation on Windows VMs.

Posted by Michael Sterling on February 03, 2023