Monitoring Keepalived with SNMP on Ubuntu 14.04

Introduction

Using keepalived in combination with a couple of HAProxy instances is a convenient yet powerful way of ensuring high availability of services.

Network map, Normal
Load balancer pair in normal state

Up until now, I’ve considered it enough to monitor the VMs where the services run, and the general availability of a HAProxy listener on the common address. The drawback is that it’s hard to see if the site is served by the intended master or the backup load balancer at a glance. The image to the right shows the intended – and at the end of this article achieved – result, with the color of the lines between nodes giving contextual information about the state of the running services.

Monitoring state changes could naïvely be achieved by continuously tailing the syslog and searching for “entered the MASTER state”. This would be a pretty resource-intensive way of solving the issue, though. A less amateurish way to go about it would to use keepalived’s built-in capability of running scripts on state changes, but there are a number of situations in which you can’t be sure that the scripts are able to run, so that’s not really what we want to do either.

Fortunately, keepalived supports SNMP, courtesy of the original author of the SNMP patch for keepalived, Vincent Bernat. In addition to tracking state changes, it potentially allows us to pull out all kinds of interesting statistics from keepalived, as long as we have a third machine from which to monitor things. Let’s set it up.

The deed

First of all, snmpd must be installed on the load balancers:

$ sudo apt update; sudo apt install snmpd

Next, let’s create a very basic SNMP listener. We begin by backing up our default snmpd configuration.

$ sudo mv /etc/snmp/snmpd.conf /etc/snmp/snmpd.conf.orig

Re-create an empty /etc/snmp/snmpd.conf and add the following lines:

master agentx            # Keepalived requires agentx for SNMP
rocommunity public zabbixserver.mydomain      # Only accept SNMP queries from the Zabbix server.

authtrapenable 1         # Yell on auth errors.
trapcommunity public 
trap2sink zabbixserver.mydomain

Restart snmpd:

$ sudo service snmpd restart

Now let’s edit the startup options for keepalived to ask it to actually speak SNMP with us. The file is called /etc/default/keepalived and for this purpose only needs to contain this single line:

DAEMON_ARGS=" -x"   # ..or " --snmp"

And finally we need to tell keepalived to throw traps. Add the following clause to the top of /etc/keepalived/keepalived.conf (this example assumes that you already have a valid keepalived configuration):

global_defs {
    enable_traps
}

Now let’s restart keepalived and see what happens.

$ sudo service keepalived restart

Monitoring

After putting the relevant MIB (KEEPALIVED-MIB available from Vincent Bernats Github site, or from Zabbix Share linked elsewhere in this article) in one of the MIB directories on the monitoring server and restarting the monitor server’s instance of snmpd, we should be able to test the features available. Since I’ve only installed and configured SNMP on the backup server yet, let’s check that one:

$ snmpwalk -v2c -cpublic testmachine2 KEEPALIVED-MIB::vrrpInstanceState
KEEPALIVED-MIB::vrrpInstanceState.1 = INTEGER: backup(1)
KEEPALIVED-MIB::vrrpInstanceState.2 = INTEGER: backup(1)

OK, that looks like it should. Now let’s shut down HAProxy on the master server, which should cause keepalived to fail over:

$ sudo service haproxy stop

And let’s check on the monitoring server again:

$ snmpwalk -v2c -cpublic testmachine2 KEEPALIVED-MIB::vrrpInstanceState
KEEPALIVED-MIB::vrrpInstanceState.1 = INTEGER: master(2)
KEEPALIVED-MIB::vrrpInstanceState.2 = INTEGER: master(2)

That works just fine. Starting the HAProxy service on the master server made the backup server return to its initial state, which means that this works.

My monitoring server is already configured as an SNMP trap sink, so let’s just see what happened in my SNMP trap log when I killed the master instance of HAProxy:

15:19:30 2016/10/14 ZBXTRAP testmachine2
PDU INFO:
  errorindex                     0
  errorstatus                    0
  receivedfrom                   UDP: [testmachine2]:53447->[zabbixserver.mydomain]:162
  notificationtype               TRAP
  messageid                      0
  version                        1
  requestid                      1067853884
  community                      public
  transactionid                  23400

VARBINDS:
  DISMAN-EXPRESSION-MIB::sysUpTimeInstance type=67 value=Timeticks: (36857) 0:06:08.57
  SNMPv2-MIB::snmpTrapOID.0      type=6  value=OID: KEEPALIVED-MIB::vrrpInstanceStateChange
  KEEPALIVED-MIB::vrrpInstanceName type=4  value=STRING: "SITEINSTANCE1"
  KEEPALIVED-MIB::vrrpInstanceState type=2  value=INTEGER: 2
  KEEPALIVED-MIB::vrrpInstanceInitialState type=2  value=INTEGER: 1
  KEEPALIVED-MIB::routerId.0     type=4  value=STRING: "testmachine2"

The relevant line is highlighted in blue. We can follow the vrrpInstanceState and see that its value is 2. This corresponds to what we saw from snmpwalk earlier, which means that we’re technically done.

What remains is to catch the trap in my monitoring application and to apply it.  I use Zabbix as my preferred monitoring tool. Thanks to Stephen E. Fritz, there are pre-made templates for keepalived on Zabbix Share, which we can use to gather statistics information.

We now have two ways of keeping track of what keepalived is up to: Polling the load balancer with regular SNMP queries, we can create graphs and trends of the uptime and various traffic data from keepalived, and when hit by an SNMP trap, we can easily trigger notification events.

A Zabbix 3.2 SNMP trigger example

I’ve created an SNMP Trap type item, with Type of information: Log. The key for the item is snmptrap[KEEPALIVED-MIB::vrrpInstanceStateChange]. This is complemented by a trigger with a Problem expression where I compare the vrrpInstanceState and the vrrpInitialState provided by the trap to determine the nature of the event:

{Template SNMP Traps Keepalived:snmptrap[KEEPALIVED-MIB::vrrpInstanceStateChange].str(KEEPALIVED-MIB::vrrpInstanceState type=2  value=INTEGER: 2)}=1 and {Template SNMP Traps Keepalived:snmptrap[KEEPALIVED-MIB::vrrpInstanceStateChange].str(KEEPALIVED-MIB::vrrpInstanceInitialState type=2  value=INTEGER: 1)}=1

To reset the trigger, I’ve set up a corresponding OK Event that closes the problem when the values of vrrpInstanceState and of vrrpInstanceInitialState are equal.

Shutting down the Master HAProxy service to trigger a fail over, the following line shows up in the Zabbix Problems screen:

Zabbix problem screen
Zabbix telling us that something is rotten.
Lots of faults on this screen
Load balancer pair in failover state

To return to the at-a-glance view mentioned at the start of this article: What do we see when we’ve failed over? As per the image at the start of the article, I’ve configured a bright blue “Color OK” value for the line indicating the connection state of the backup load balancer to indicate that the services on the server are running and ready to take over in case of failure of the master node. The image to the right clearly shows that it’s easy to see when a failover occurs with green lines turning red and the previously blue link to the backup load balancer turning green.

This article has illustrated one of my favourite things about the Unix philosophy: By passing simple text content between programs that each do one thing well we have created a whole that is a lot bigger than its individual parts.

2 comments

  1. Hi, i use your howto, on centos is needed config without comment. With comment not work agentx.

    Client step
    1)Open FW
    firewall-cmd –add-service=snmp –permanent
    firewall-cmd –reload
    2)Install snmp
    yum install net-snmp

    3)Change config (me conf)
    vi /etc/snmp/snmpd.conf
    #agentAddress udp:127.0.0.1:161

    master agentx
    rocommunity public ZABBIX_SERVER_HOSTNAME

    authtrapenable 1
    trapcommunity public
    trap2sink ZABBIX_SERVER_HOSTNAME

    4) Restart SNMP
    systemctl restart snmpd.service

    5) Change startup option
    vi /etc/sysconfig/keepalived
    # Options for keepalived. See `keepalived –help’ output and keepalived(8) and
    # keepalived.conf(5) man pages for a list of all options. Here are the most
    # common ones :
    #
    # –vrrp -P Only run with VRRP subsystem.
    # –check -C Only run with Health-checker subsystem.
    # –dont-release-vrrp -V Dont remove VRRP VIPs & VROUTEs on daemon stop.
    # –dont-release-ipvs -I Dont remove IPVS topology on daemon stop.
    # –dump-conf -d Dump the configuration data.
    # –log-detail -D Detailed log messages.
    # –log-facility -S 0-7 Set local syslog facility (default=LOG_DAEMON)
    #

    KEEPALIVED_OPTIONS=”–snmp -D”

    6) Add traps ( i add on first line)
    vi /etc/keepalived/keepalived.conf
    global_defs {
    enable_traps
    }

    7) Restart and check keepalive
    systemctl restart keepalived.service

    Feb 10 13:02:49 ts4000zkdblb02 Keepalived[8533]: Stopping Keepalived v1.2.13 (11/05,2016)
    Feb 10 13:02:49 ts4000zkdblb02 systemd: Cannot add dependency job for unit microcode.service, ignoring: Unit is not loaded properly: Invalid argument.
    Feb 10 13:02:49 ts4000zkdblb02 systemd: Stopping LVS and VRRP High Availability Monitor…
    Feb 10 13:02:49 ts4000zkdblb02 systemd: Starting LVS and VRRP High Availability Monitor…
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived[9431]: Starting Keepalived v1.2.13 (11/05,2016)
    Feb 10 13:02:49 ts4000zkdblb02 systemd: PID file /var/run/keepalived.pid not readable (yet?) after start.
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived[9432]: Starting Healthcheck child process, pid=9433
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived[9432]: Starting VRRP child process, pid=9434
    Feb 10 13:02:49 ts4000zkdblb02 systemd: Started LVS and VRRP High Availability Monitor.
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_healthcheckers[9433]: Netlink reflector reports IP 10.253.50.11 added
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_healthcheckers[9433]: Netlink reflector reports IP fe80::250:56ff:fe8b:24d4 added
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_healthcheckers[9433]: Registering Kernel netlink reflector
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_healthcheckers[9433]: Registering Kernel netlink command channel
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_healthcheckers[9433]: Starting SNMP subagent
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_vrrp[9434]: Netlink reflector reports IP 10.253.50.11 added
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_vrrp[9434]: Netlink reflector reports IP fe80::250:56ff:fe8b:24d4 added
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_vrrp[9434]: Registering Kernel netlink reflector
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_vrrp[9434]: Registering Kernel netlink command channel
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_vrrp[9434]: Registering gratuitous ARP shared channel
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_vrrp[9434]: Starting SNMP subagent
    Feb 10 13:02:49 ts4000zkdblb02 Keepalived_vrrp[9434]: NET-SNMP version 5.7.2 AgentX subagent connected

Leave a comment

Your email address will not be published. Required fields are marked *