Yesterday I realized that my primary public DNS server was not reachable from the Internet. Suddenly I thought it was a firewall rule problem, since I recently had migrated the single ISA Server 2004 virtual machine to a new ISA 2006 high-available array, but I was wrong: nothing about the ISA configuration, neither about the ifconfig on the published BIND server.
It has took me almost one day to realize that the randomly behavior of the publishing service was due probably to the NLB driver running in multicast mode. When it process UDP requests as common DNS lookups, it creates an association between the client and the NLB node: this is called “client affinity”. When the affinity is established between a resolver and the GE1FW02 node, it seems there are problems related with a timeout serving the request.
I have temporary workarounded this problem by publishing the BIND server with a reverse DNS proxy rule, so by making ISA Server change the header of IP packet to show this internal IP as the source address for all the lookups coming from the External network, but that’s not all… I want to know if the problem really show up only with DNS lookups are served by the second NLB node, by stopping the first node but, obviously, without creating any disruption to external Internet clients. I’ll post the results later…
I’ve just shooted down GE1FW01 and, with the same publishing rule of the beginning, all worked greats. After turning on back the first NLB node I experienced the same random behavior and, also by redefining the DNS Server protocol on ISA Server in order only to allow queries on TCP connections, it did not worked until I redefined the publishing rule as a DNS proxy rule… well: I’ve no more time to spend to find a better solution now! It works as I wish, but in a way I do not like very much. Hoping I will ever have the time to better understand the problem, but I know it’s due to the NLB virtualization 🙁