High availability of services with ZABBIX and DNS failover

This blog was born only for testing WordPress some years ago, thus there is no reason to maintain it, but from time to time I like to post here about some change I make in our infrastructure, or about some product or technology I discover to be interesting, more to remind me when I did or read something than to actually inform someone out there, so please excuse me for the fuzzy style of the contents!

Today I put in production a procedure to make inbound Internet traffic automatically fail over a secondary ISP link, by using the strong-tested ZABBIX monitoring platform.

Our primary NOC uses two independent and full-redundant links (two-node firewall, two routers, etc.) in order to access the Internet, and all production-grade services (DNS, mail, IM, web, etc.) are continuously accessible on the public IP addresses of both the links.

Until today, when a connection failure occurred, all clients in our internal networks were immediately able to continue browsing by using the failover link, thanks to a simple source-based routing rule applied by our pfSense cluster, whereas all clients from the Internet couldn’t access the services through the secondary path until the RRs in our DNS zones were manually changed to reply the resolvers with the public IP address in the range of our secondary ISP.

I evaluated a couple of good external DNS failover services: Dynect Active Failover, DNS Made Easy’s service. The first was too expensive for our needs and the second was missing the ICMP ping check we wanted to use.
Then I gave a try to the failover host support of the TinyDNS package for pfSense. It works pretty well, but it would need two public IPs (one from each ISP range) to publish the djbdns service for the dynamic-updating zone, and at this time the range from our secondary provided is exhausted.

So it come the idea to run the dynamic zone on the same DNS servers we use for our public zones, but who might update the RRs in a reliable way? I was pretty confident in the link failure detection of pfSense, which I still use to redirect outbound Internet traffic, but I didn’t like the idea of trusting any other link failure detection script or daemon runnig inside my network… until I had a flash: ZABBIX has been reliably notifying me link failures and recoveries for several months by now. Maybe I could configure it to run the nsupdate(1) command against our primary DNS server each time such an event is triggered!

In fact it has been pretty trivial to configure a new custom media type “script” (named “nsupdate_HA“) and execute it as an “operation” from the action performed when the trigger “link failure” is generated, as shown in this screenshot.

From now on, the hostname of each server publishing a “mission-critical” service can be stored as a CNAME pointing to an A-type record in the ha.valsania.it zone, which is automatically set to the right available public IP address. I measured that the reaction time to a link state change is around 40 seconds: this will definitely make me sleep better at night!


UPDATE:
maybe it can be useful for someone to take a look at the simple shell script I wrote to accept input from ZABBIX, or maybe someone can suggest some improvements!
Three arguments are expected (the recipient, the subject and the body of the message), but we only read the 2nd to know what’s happening, in order to execute proper failover and failback actions.

3 thoughts on “High availability of services with ZABBIX and DNS failover

  1. Thanks for sharing your script. I had considered this approach a couple months back but took a different path.

    Zabbix also has Actions which can be used to call external scripts and pass arguments. This combines nicely with PowerDNS when using a native mysql backend (with mysql replication to redundant name servers). Zabbix monitors site IPs and when it detects an outage or unacceptable response time triggers an action script. In my setups the action scripts verifies the outage from other Zabbix server locations – then modifies mysql dns records which replicate and almost instantly change the DNS resolutions from geographically distributed powerdns servers. I’ve used this approach to provide automated DNS failover, failback, and loadbalancing for several large clients in the past (and this is the combination I settled on using as the engine for my startup company http://www.cnamefailover.com).

  2. Yes, I thought to use the “remote command” operation type too, but it can only make a ZABBIX agent to launch a script, which is not what I was looking for (even if I could run an agent on the server itself).
    Obviously my need is “specular” to yours: in my case the ZABBIX server and authoritative DNS servers are behind my ISP-failover enabled firewalls, and I aimed to run the nsupdate command by using as few pieces as possible, considering every further bit a source of unreliability.
    By using this approach I only have two possible failure elements: the ZABBIX server process and the dynamic DNS server… both of whom I’m pretty confident with! 🙂

    Your service seems to be pretty well designed. The only thing I don’t understand well is about the update method: you change the zone data by updating the RR directly in the MySQL database, and then you rely on standard DNS notify and AXFR methods to replicate to other name servers?

  3. Andrea – the DNS updates flow through mysql replicaton. Powerdns has a “native” mode that can be used with the mysql backend. When zabbix detects an outage, it updates the RR directly in one of two mysql master servers. Those masters have power dns running in “native” mode. Each one of my masters then has several slave servers, that pull a slave only mysql replicaton feed, with powerdns running locally on them hitting their slave database in native mode. The system does not depend on notify or AXFR methods – but depends on changes to the database backend propagated through mysql replication to the various name servers. I then configured powerdns to key off the priority field with a custom resolver query… an A record with a value equal to or greather than 0 gets resolved by powerdns, an A record with a value less than 0 is not resolved by powerdns. One of my zabbix servers simply updates the the priority field column in the powerdns record to place it in or out of service (created triggers to fire when sites go down and when they come back up). The updates flow to all of the dns servers through replication almost instantly.

    The whole system is dependent on keeping mysql replication working smoothly which is easy with zabbix monitoring. I setup zabbix to monitor replication status on all dns servers closely. I’ve worked in high volume environments pushing tremendous amounts of data through mysql replication farms and trust it completely in this DNS scenerio with much smaller replication volumes.

    Thanks,
    -Scott

Comments are closed.