We recently completed a large “Lift and Shift” of our on-premises application hosting into the Microsoft Azure cloud. In this post, I will describe one of the problems we encountered, troubleshot, and resolved on the new Windows Virtual Machines that were now hosting our C# applications.
Our applications would generate network-related errors at random when running in the new environment. We could go days without seeing an error from any particular application, but there were always some errors – no rhyme nor reason across the network as a whole.
When connecting to our MySql instances, we might have seen:
MySqlConnector.MySqlException (0x80004005): Unable to connect to any of the specified MySQL hosts.
at MySqlConnector.Core.ServerSession.OpenTcpSocketAsync
at MySqlConnector.Core.ServerSession.ConnectAsync
When making HTTP requests we might have seen:
System.Net.WebException: The remote name could not be resolved: 'example.com'
at System.Net.HttpWebRequest.EndGetResponse
or
System.Net.Http.HttpRequestException: No such host is known. (example.com:443)
---> System.Net.Sockets.SocketException (11001): No such host is known.
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync
These errors would occur for host names that absolutely existed and were readily accessible otherwise. The specific host did not matter–these errors occurred for hosts we controlled as well as those controlled by third parties.
The main trouble was that the problem was just so rare, yet persistent; specific, yet pernicious. How could we reproduce it? Was our application code doing something wrong, or was it the C# framework, or could any request from those VMs reproduce it? Was it DNS, or was it the request after DNS resolved? If it was DNS, was it specific to the way we set up our DNS hosts or virtual network? If it was the request, was it the source-server, was it the network, or was it the destination server? Was it load-related, on the source VM or on some network bottleneck or in Azure DNS itself?
We were eventually able to reproduce the errors outside of running apps using Powershell scripts running on our Azure VMs (it never reproduced outside of Azure). In the first versions of the script, it needed to run for a few days before we could know whether we had managed to reproduce the problem.
By testing many variants of the script, we were able to narrow the problem down to DNS resolution itself. I was then able to get the “Mean-Time-to-Reproduce” down to “randomly between two to six hours” by having the script clear the DNS cache between each attempt.
This is the simplest version of the Powershell we used to reproduce the problem:
Write-Host "Start: $(Get-Date -Format "o")"
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
do {
Start-Sleep -Seconds 1
Clear-DnsClientCache
$time = (Measure-Command {
try {
$result = [System.Net.Dns]::GetHostAddresses("azure.microsoft.com")
} catch {
Write-Host $_
$result = $False
}
}).TotalSeconds
} while ($result);
Write-Host "$time-second timeout after running for $($stopwatch.Elapsed.TotalMinutes) minutes at $(Get-Date -Format "o")"
When it finally encountered the error, the DNS resolution would time out after about 11 seconds. The script output looked something like this:
Start: 2023-01-04T18:39:37.8532468+00:00
Exception calling \"GetHostAddresses\" with \"1\" argument(s): \"No such host is known\"
12.1075537-second timeout after running for 65.1770462016667 minutes at 2023-01-04T19:44:48.8129523+00:00
We now knew the problem wasn’t specific to our application code.
From there, we discovered that when using an entirely fresh Azure Windows VM in in an entirely fresh VNet using Azure’s default DNS provider, we could run this script and reproduce the problem. This meant it wasn’t (necessarily) a problem with our “Lift and Shift” network or our own Faithlife DNS configuration. Further, we could create two reproduction environments, one using Azure’s built-in DNS and one using Google’s 8.8.8.8
DNS. The Google environment did not reproduce the problem.
This narrowed the problem down decisively: it occurred on a default-configuration Microsoft Windows VM on a default-configuration Microsoft Azure Virtual Network trying to use default-configuration Microsoft Azure DNS to resolve azure.microsoft.com
.
At this point, we escalated our Azure support ticket. The new rep spotted something interesting when we reproduced the problem and got them a packet capture! They noticed that the DNS request which failed happened to use UDP Source Port 65330 when talking to Azure DNS – and if you dig deep into the VNet documentation, you will discover that UDP Port 65330 is reserved for Azure’s own use.
The final script to reproduce the problem looks like this:
#!/usr/bin/env bash # Azure subscription ID to use subscription="$1" # Azure location (e.g. "westus3") location="$2" # When not provided, sets up a new reproduction environment. # When provided, fetches results of previous run. suffix="$3" set -e +H if test -z "${subscription}" -o -z "${location}"; then echo "Usage: \`${0} subscription location [suffix]\`" exit 1 fi if test -z "${suffix}"; then suffix=$RANDOM create=1 else create=0 fi resource_group="dns-rg-$suffix" azure_vnet="dns-vnet-azure-$suffix" google_vnet="dns-vnet-google-$suffix" azure_vm="dns-vm-azure-$suffix" google_vm="dns-vm-google-$suffix" storage_account="dnssa$suffix" packet_capture="dns-capture-$suffix-$RANDOM" if test $create -eq 1; then echo "Creating networks..." az group create --subscription "$subscription" --location "$location" \ --name "$resource_group" azure_subnet_id=$(az network vnet create --subscription "$subscription" --location "$location" --resource-group "$resource_group" \ --name "$azure_vnet" --subnet-name test \ -otsv --query "newVNet.subnets[0].id" | tr -d '\r') google_subnet_id=$(az network vnet create --subscription "$subscription" --location "$location" --resource-group "$resource_group" \ --name "$google_vnet" --subnet-name test --dns-servers "8.8.8.8" \ -otsv --query "newVNet.subnets[0].id" | tr -d '\r') echo "Creating VMs..." # If you get an error about `--subnet_id` being invalid, try setting `export MSYS_NO_PATHCONV=1` to disable path expansion az vm create --subscription "$subscription" --resource-group "$resource_group" --location "$location" \ --name "$azure_vm" --computer-name "dns-a-$suffix" --subnet "$azure_subnet_id" \ --image Win2022AzureEditionCore --size Standard_B1s \ --admin-username azureuser --admin-password "12345abcde!ABCDE" --public-ip-address "" --nsg "" --public-ip-sku Basic \ --nic-delete-option delete --os-disk-delete-option delete --data-disk-delete-option delete az vm create --subscription "$subscription" --resource-group "$resource_group" --location "$location" \ --name "$google_vm" --computer-name "dns-g-$suffix" --subnet "$google_subnet_id" \ --image Win2022AzureEditionCore --size Standard_B1s \ --admin-username azureuser --admin-password "12345abcde!ABCDE" --public-ip-address "" --nsg "" --public-ip-sku Basic \ --nic-delete-option delete --os-disk-delete-option delete --data-disk-delete-option delete echo "Installing Network Watcher..." az vm extension set --subscription "$subscription" --resource-group "$resource_group" --vm-name "$azure_vm" --publisher Microsoft.Azure.NetworkWatcher --name NetworkWatcherAgentWindows --version 1.4 echo "Creating Storage Account..." az storage account create --subscription "$subscription" --resource-group "$resource_group" --location "$location" \ --name "$storage_account" \ --sku Standard_RAGRS \ --kind StorageV2 echo "Starting packet capture..." az network watcher packet-capture create --subscription "$subscription" --resource-group "$resource_group" --vm "$azure_vm" --name "$packet_capture" --storage-account "$storage_account" \ --filters "[{\"protocol\":\"UDP\", \"remoteIpAddress\":\"168.63.129.16\"}]" echo "Running test..." # The RunPowerShellScript command will "time out" after about an hour, but the script will keep running on the VM until it exits. test_command=' Write-Host "Start: $(Get-Date -Format "o")" $stopwatch = [system.diagnostics.stopwatch]::StartNew() do { Start-Sleep -Seconds 1 Clear-DnsClientCache $time = (Measure-Command { try { $result = [System.Net.Dns]::GetHostAddresses("azure.microsoft.com") } catch { Write-Host $_ $result = $False } }).TotalSeconds } while ($result); Write-Host "$time-second timeout after running for $($stopwatch.Elapsed.TotalMinutes) minutes at $(Get-Date -Format "o")" ' az vm run-command invoke --subscription "$subscription" --resource-group "$resource_group" \ --name "$azure_vm" \ --command-id RunPowerShellScript --no-wait --scripts "$test_command" az vm run-command invoke --subscription "$subscription" --resource-group "$resource_group" \ --name "$google_vm" \ --command-id RunPowerShellScript --no-wait --scripts "$test_command" # Wait for completion. To fetch results, run script again. echo "Test started. Please run \`${0} $subscription $location $suffix\` to retrieve results after letting script run for at least five hours." else echo "Stopping packet capture..." az network watcher packet-capture stop --subscription $subscription --name $packet_capture --location $location echo "Fetching results..." # Enumerates the directory where Azure keeps the output of previous RunPowerShellScript commands result_command=' Get-ChildItem -Path C:\Packages\Plugins\Microsoft.CPlat.Core.RunCommandWindows\1.1.15\Status ` | Get-Content | ConvertFrom-Json ` | % { $a = @{ Step = $_.status.name; Timestamp = $_.timestampUTC }; $_.status.substatus ` | where { $_.name -eq "StdErr" -or $_.name -eq "StdOut" } ` | Foreach { $a[$_.name] = $_.formattedMessage.message }; [pscustomobject] $a; } ` | Sort-Object -Descending -Property Timestamp | Select-Object -Skip 0 -First 3 | Format-List' echo "" echo "Azure DNS:" printf '%b\n' "$(az vm run-command invoke --subscription "$subscription" --resource-group "$resource_group" \ --name "$azure_vm" \ --command-id RunPowerShellScript --scripts "$result_command" --query "value[].message | join('', @)")" echo "" echo "Google DNS:" printf '%b\n' "$(az vm run-command invoke --subscription "$subscription" --resource-group "$resource_group" \ --name "$google_vm" \ --command-id RunPowerShellScript --scripts "$result_command" --query "value[].message | join('', @)")" echo "" echo "To clean up, \`az group delete --yes --no-wait --subscription $subscription --resource-group $resource_group\`." fi
So this was the situation:
Which finally brings us to the fix: Tell Windows not to use the port that Microsoft doesn’t want us using in Azure. You do so by giving it a “range” of 65330 - 1 - 49152 = 16177
ports to choose from instead of the full 16383.
We rolled this fix out to all of our Windows VMs using Puppet:
exec { 'Disable UDP/65330':
command => 'Set-NetUDPSetting -DynamicPortRangeStartPort 49152 -DynamicPortRangeNumberOfPorts 16177',
unless => 'exit [int](Get-NetUDPSetting | % { $_.DynamicPortRangeStartPort + $_.DynamicPortRangeNumberOfPorts -gt 65330 })',
provider => powershell,
}
We’ve suggested that Microsoft consider correcting their base Windows images so that they do not default to using Azure-incompatible ports or, if that is infeasible, improving the Azure documentation on Windows VMs.
Posted by Michael Sterling on February 03, 2023