Service Monitoring

A post about monitoring service improvements on the Mythic Beasts‘ blog made me think about all of the services that we monitor within the OldElvet family of systems. We use Icinga (formerly we used Nagios) to monitor our systems and for the most part we follow the traditional style of monitoring general server health such as CPU usage, disk usage and system load.

That helps us to monitor what is going on within the system but there are a number of other things that we monitor to ensure that the services on the system are being good network citizens. The frequency of these checks is somewhat reduced from every 5 minutes for general server health to every 6 to 24 hours for others. Examples include:

  • Email blacklisting – like good citizens we try (and succeed) to not send out spam. But there is always a risk that something goes wrong and getting a heads-up sooner rather than later is a good thing. As such we run a DNS RBL lookup for all of our externally visible IP addresses.

    define command {
        command_name    check_bl
        command_line    /usr/lib/nagios/plugins/check_bl -H $HOSTADDRESS$ -B sbl-xbl.spamhaus.org
    }
    
  • Email backscatter – backscatter is a particularly cruel form of spam where you end up filling peoples Inboxes with bounces where spammers have been blocked sending spam to you.

    define command {
        command_name    check_backscatter
        command_line    /usr/lib/nagios/plugins/check_bl -H $HOSTADDRESS$ -B ips.backscatterer.org
    }
    

    The main way to eliminate backscatter is to reject mail as it arrives at your systems rather than accepting it and then bouncing it a few seconds later.

    SPF is another good way of reducing backscatter back to yourself. By carefully controlling which servers are allowed to send email from your domains it pretty much eliminated bounce back spam. There are some downsides with traditional email forwarding but if you can avoid systems using that it works fine.

  • TLS/SSL certificate validity checks – there are a number of high profile cases where certificates have expired taking down various cloud services and similar systems. It is quite easy to add a check that connects to your TLS/SSL systems and warns if the certificate expiry date is approaching.

    define command {
        command_name    check_simap_cert
        command_line    /usr/lib/nagios/plugins/check_imap -p 993 -H '$HOSTADDRESS$' -D 30
    }
    
  • NTP (Network Time Protocol) time drift and lost/broken upstream servers. This is less of an issue recently but a few years ago it was fairly common for NTP servers to stop working or to drift and a check of time consistency provides a good guard against that.

    define command{
        command_name    check_ntp_time_over_ssh
        command_line    /usr/lib/nagios/plugins/check_by_ssh -t 50 -H $HOSTADDRESS$ -p 22 -l auser -i /etc/akey -C "/usr/lib/nagios/plugins/check_ntp_time -H $ARG1$ -w 0.5 -c 5"
    }
    
  • APT security update availability – Applying patches is a bore at times but it does help to keep your systems secure. A simple check can ensure that your monitoring system warns you when new versions are available.

    define command{
        command_name    check_apt_over_ssh
        command_line    /usr/lib/nagios/plugins/check_by_ssh -t 50 -H $HOSTADDRESS$ -p 22 -l auser -i /etc/akey -C "/usr/lib/nagios/plugins/check_apt -t 30"
    }
    

    The existing tools do not take held and forbidden package versions into account but for the most part they do help to keep you honest with security updates.

  • WordPress update availability – this helps to warn about any pending updates and is probably more important than general OS security updates.

    See https://binfalse.de/software/nagios/check_wp-php/

    These checks and more help to give confidence that the system isn’t slipping into disrepair.