Using Information About Our Network to Remove Monitoring Noise

Our team adds new checks and alerts every week so that we can stay ahead of new issues. We try very hard to make sure that each alert is configured and tested such that it provides timely and credible evidence of a real problem. Sometimes though, when things go wrong we are inundated with alert information which actually hinders and confuses our problem identification and resolution.

A real world example

A server with two 10 Gigabit network connections experiences a hardware failure and spontaneously reboots. Our Campfire room is filled with alerts not just for the host being down, but also for the switch (ports) the host is connected to.

We monitor the switch ports because we want to know that they are at the correct speed, that there are no individual failures, and that no “foreign” devices have been plugged into the network. In the case of a host failure, the information about the switch ports is secondary to the information about the host—but it represents 2x the volume of alert data we receive.

In cases like this we need to make our monitoring system more aware of the dependencies exist between these checks so that we can eliminate the noise. To do so we use a number of open source technologies:

Link Layer Discovery Protocol

The Link Layer Discovery Protocol (LLDP) is a vendor-neutral link layer protocol in the Internet Protocol Suite used by network devices for advertising their identity, capabilities, and neighbors on an IEEE 802 local area network, principally wired Ethernet.

(Via http://en.wikipedia.org/wiki/Link_Layer_Discovery_Protocol.)

A more human readable description is that LLDP is a special Internet Protocol that allows us to find out what switches (and switch ports) are given server is plugged in to.

First we had to configure our switches to support lldp. We did so using a basic global configuration entry:

protocol lldp
 advertise management-tlv system-description system-name

After the switches are configured we can collect information from each host through the use of lldpctl. Here’s some sample output from lldpctl:

-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    eth0, via: LLDP, RID: 1, Time: 16 days, 18:45:59
  Chassis:
    ChassisID:    mac 00:01:e8:8b:0a:c1
    SysName:      zk100-switch1
    SysDescr:     Dell Force10 Real Time Operating System...
  Port:
    PortID:       ifname TenGigabitEthernet 0/23
    PortDescr:    Not received
-------------------------------------------------------------------------------
Interface:    eth1, via: LLDP, RID: 2, Time: 16 days, 18:46:03
  Chassis:
    ChassisID:    mac 00:01:e8:8b:0a:82
    SysName:      zk100-switch2
    SysDescr:     Dell Force10 Real Time Operating System...
  Port:
    PortID:       ifname TenGigabitEthernet 0/23
    PortDescr:    Not received

As you can see we get information on each connected interface. If the interfaces are in a port-channel (bonded) we would get information about the port-channel too.

What’s nice is that lldpctl can present the output in multiple ways. By passing in ‘-f keyvalue’ we get the same information but it’s formatted in a way that we can easily parse it:

lldp.eth0.via=LLDP
lldp.eth0.rid=1
lldp.eth0.age=16 days, 18:49:22
lldp.eth0.chassis.mac=00:01:e8:8b:0a:c1
lldp.eth0.chassis.name=zk100-switch1
lldp.eth0.chassis.descr=Dell Force10 Real Time Operating System...
lldp.eth0.port.ifname=TenGigabitEthernet 0/23
lldp.eth0.port.descr=Not received
lldp.eth1.via=LLDP
lldp.eth1.rid=2
lldp.eth1.age=16 days, 18:49:26
lldp.eth1.chassis.mac=00:01:e8:8b:0a:82
lldp.eth1.chassis.name=zk100-switch2
lldp.eth1.chassis.descr=Dell Force10 Real Time Operating System...
lldp.eth1.port.ifname=TenGigabitEthernet 0/23
lldp.eth1.port.descr=Not received

Gathering the information provided by LLDP

So how do we gather this data on every server so that we can use it construct the correct service dependency? Since we use Chef for configuration management we can leverage an Ohai plugin!

Ohai is a tool that is used to detect certain properties about a node’s environment and provide them to the chef-client during every Chef run.

So every time the chef-client is run, we can gather the data up and make it available. As if it wasn’t easy enough, (John Dewey) posted such an Ohai plugin that we can use:


#
# Cookbook Name:: ohai
# Plugin:: llpd
#
# "THE BEER-WARE LICENSE" (Revision 42):
# <[email protected]> wrote this file. As long as you retain this notice you
# can do whatever you want with this stuff. If we meet some day, and you think
# this stuff is worth it, you can buy me a beer in return John-B Dewey Jr.
#

provides "linux/llpd"

lldp Mash.new

def hashify h, list
  if list.size == 1
    return list.shift
  end

  key    = list.shift
  h[key] ||= {}
  h[key] = hashify h[key], list
  h
end

begin
  cmd = "lldpctl -f keyvalue"
  status, stdout, stderr = run_command(:command => cmd)

  stdout.split("\n").each do |element|
    key, value = element.split(/=/)
    elements = key.split(/\./)[1..-1].push value

    hashify lldp, elements
  end

  lldp
rescue => e
  Chef::Log.warn "Ohai llpd plugin failed with: '#{e}'"
end

Now every one of our Chef node (server) objects has an lldp attribute which we can use to build the correct service dependency.

Since we manage our Nagios configuration via Chef we just need to add a few lines to the service_dependency erb template:


<% @hardware_nodes.each do |hardware_node| %>
  <% if hardware_node[:lldp] && hardware_node['hostname'] != node['hostname'] %>
    <% hardware_node[:lldp].each do |int, data| %>
      <% if data['port']['ifname'] and data['chassis']['name'] %>
        define servicedependency {
          host_name  <%= hardware_node['hostname'] %>
          service_description  Check_MK
          dependent_service_description  Interface <%= data['port']['ifname'] %>
          dependent_host_name  <%= data['chassis']['name'] + "." + hardware_node[:domain] %>
          notification_failure_criteria     w,c
        }
      <% end %>
    <% end %>
  <% end %>
<% end %>

Which results in service dependency entries like this:

define servicedependency {
        host_name       cats-02
        service_description     Check_MK
        dependent_service_description   Interface TenGigabitEthernet 0/34
        dependent_host_name     zk100-switch2.sc-chi-int.37signals.com
        notification_failure_criteria            w,c
}

In the above example we are identifying that we want to identify a dependency of the switchport between cats-02 and zk100-switch2. The outcome of this dependency is that instead of getting an alert for both the host and associated switch ports being down, we only get alerted that the host is down. (We know/expect the switch ports to be down too.)

Additional Configuration

We also needed to set “soft_state_dependencies=1” in our Nagios configuration:

This option determines whether or not Nagios will use soft state information when checking host and service dependencies. Normally Nagios will only use the latest hard host or service state when checking dependencies. If you want it to use the latest state (regardless of whether its a soft or hard state type), enable this option.

(Via http://nagios.sourceforge.net/docs/3_0/configmain.html.)

Here’s the difference it makes in our Campfire when we get these alerts:

Without Service Dependency

sc-chi Interface TenGigabitEthernet 0/10 CRITICAL zk100-switch1.sc-chi-int.37signals.com PROBLEM CRIT - (down)(!!) MAC: 00:01:e8:8b:08:2c, 1GBit/s, in: 478.74kB/s(0.4%), out: 0.00B/s(0.0%).

sc-chi Interface TenGigabitEthernet 0/10 CRITICAL zk100-switch2.sc-chi-int.37signals.com PROBLEM CRIT - (down)(!!) MAC: 00:01:e8:8b:08:2c, 1GBit/s, in: 403.11kB/s(0.3%), out: 0.00B/s(0.0%)

sc-chi Interface TenGigabitEthernet 0/6 CRITICAL zk100-switch1.sc-chi-int.37signals.com PROBLEM CRIT - (down)(!!) MAC: 00:01:e8:8b:08:2c, 1GBit/s, in: 0.00B/s(0.0%), out: 0.00B/s(0.0%).

sc-chi Interface TenGigabitEthernet 0/6 CRITICAL zk100-switch2.sc-chi-int.37signals.com PROBLEM CRIT - (down)(!!) MAC: 00:01:e8:8b:08:2c, 1GBit/s, in: 0.00B/s(0.0%), out: 0.00B/s(0.0%).

sc-chi cats-02 DOWN PROBLEM DOWN CRITICAL - 10.99.22.37: Host unreachable @ 10.99.22.1. rta nan, lost 100%

With Service Dependency

sc-chi cats-02 DOWN PROBLEM DOWN CRITICAL - 10.99.22.37: Host unreachable @ 10.99.22.1. rta nan, lost 100%

Much Better!

Taylor wrote this on Jul 15 2013 There are 3 comments.

Devon

on 15 Jul 13

In reference to:

“A server with two 10 Gigabit network connections …”

Does your datacenter actually provide you 10 gigabit links? Or are you just using a network card capable of 10 gigabit bit your on an actual 100/1000 megabit link?

Taylor

on 16 Jul 13

@Devon,

We manage all of our own infrastructure so the datacenter only provides space, power and cooling, and connections to the meet me room. We utilize multiple 1 Gigabit links from a number of providers for our Internet connections.

In the text you referenced I was referring to our local network. We use multiple 10 Gigabit top of rack switches and we connect every server to two switches for redundancy. This allows us to have a reasonable level of fault tolerance and to conduct maintenance on the network without user facing interruption.

Jaime

on 19 Jul 13

Hi Taylor,

Do you use any third-party (external) monitoring solution as well?

I’m looking for outstanding options for a benchmarking study.

Thank you! Jaime.co