High availability of services with ZABBIX and DNS failover

This blog was born only for testing WordPress some years ago, thus there is no reason to maintain it, but from time to time I like to post here about some change I make in our infrastructure, or about some product or technology I discover to be interesting, more to remind me when I did or read something than to actually inform someone out there, so please excuse me for the fuzzy style of the contents!

Today I put in production a procedure to make inbound Internet traffic automatically fail over a secondary ISP link, by using the strong-tested ZABBIX monitoring platform.

Our primary NOC uses two independent and full-redundant links (two-node firewall, two routers, etc.) in order to access the Internet, and all production-grade services (DNS, mail, IM, web, etc.) are continuously accessible on the public IP addresses of both the links.

Until today, when a connection failure occurred, all clients in our internal networks were immediately able to continue browsing by using the failover link, thanks to a simple source-based routing rule applied by our pfSense cluster, whereas all clients from the Internet couldn’t access the services through the secondary path until the RRs in our DNS zones were manually changed to reply the resolvers with the public IP address in the range of our secondary ISP.

I evaluated a couple of good external DNS failover services: Dynect Active Failover, DNS Made Easy’s service. The first was too expensive for our needs and the second was missing the ICMP ping check we wanted to use.
Then I gave a try to the failover host support of the TinyDNS package for pfSense. It works pretty well, but it would need two public IPs (one from each ISP range) to publish the djbdns service for the dynamic-updating zone, and at this time the range from our secondary provided is exhausted.

So it come the idea to run the dynamic zone on the same DNS servers we use for our public zones, but who might update the RRs in a reliable way? I was pretty confident in the link failure detection of pfSense, which I still use to redirect outbound Internet traffic, but I didn’t like the idea of trusting any other link failure detection script or daemon runnig inside my network… until I had a flash: ZABBIX has been reliably notifying me link failures and recoveries for several months by now. Maybe I could configure it to run the nsupdate(1) command against our primary DNS server each time such an event is triggered!

In fact it has been pretty trivial to configure a new custom media type “script” (named “nsupdate_HA“) and execute it as an “operation” from the action performed when the trigger “link failure” is generated, as shown in this screenshot.

From now on, the hostname of each server publishing a “mission-critical” service can be stored as a CNAME pointing to an A-type record in the ha.valsania.it zone, which is automatically set to the right available public IP address. I measured that the reaction time to a link state change is around 40 seconds: this will definitely make me sleep better at night!


UPDATE:
maybe it can be useful for someone to take a look at the simple shell script I wrote to accept input from ZABBIX, or maybe someone can suggest some improvements!
Three arguments are expected (the recipient, the subject and the body of the message), but we only read the 2nd to know what’s happening, in order to execute proper failover and failback actions.