Dependencies - Services

Service dependencies allow you to suppress notifications and active checks of services based on the status of other service(s). Why would you want this anyway, when a host goes down doesn't this happen anyway?

Yes and No.

Yes, when a host goes down (non-OK state), Nagios suppresses notifications for it's child services.

No, when a host goes down (non-OK state), Nagios continues to actively execute the checks on it's child services.

However there are a lot of different scenarios when a host does not go down, yet we want to stop checks from executing and notifications from being sent.

NRPE Services

You are monitoring windows servers using check_nrpe and the agent on the windows machine is NSClient++. What happens when the NSClient++ agent suddenly crashes? The windows host isn't down, so the checks will continue to execute and the notifications will be sent.

A way to prevent this from generating a lot of alerts is to implement a service dependency where all the services that rely on check_nrpe depend on a "NRPE master service".

First, here are my definitions that will be used to base our dependencies on:

define host {
    use windows-server
    host_name host1
    alias host1
    address 10.25.14.51
    }

define host {
    use windows-server
    host_name host2
    alias host2
    address 10.25.14.52
    }

define hostgroup {
    hostgroup_name nrpe_hosts_windows
    alias nrpe_hosts_windows
    members host1,host2
    }

define command {
    command_name check_nrpe_status
    command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30
    }

define command {
    command_name check_nrpe
    command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$ -a $ARG2$
    }

define service {
    use local-service
    hostgroup_name nrpe_hosts_windows
    service_description NRPE Status
    check_command check_nrpe_status
    max_check_attempts 2
    check_interval 1
    retry_interval 1
    }

define service {
    use local-service
    hostgroup_name nrpe_hosts_windows
    service_description CPU Load
    check_command check_nrpe!CheckCPU!warn=80 crit=90 time=1m time=5m time=15m ShowAll
    max_check_attempts 4
    check_interval 1
    retry_interval 1
    }

define service {
    use local-service
    hostgroup_name nrpe_hosts_windows
    service_description Disk Usage - C:
    check_command check_nrpe!CheckDriveSize!ShowAll MinWarn=10G MinCrit=5G Drive=C:
    max_check_attempts 4
    check_interval 1
    retry_interval 1
    }


One of these services is called NRPE Status and NSClient++ simply returns a response like I (0,4,1,105 2014-04-28) seem to be doing fine... with an OK status. If NSClient++ fails to respond, check_nrpe will return a response like CHECK_NRPE: Socket timeout after 30 seconds. with a CRITICAL status. When this happens, we don't want any other NRPE based checks to run or be notified about them, until this original problem is resolved.

The Most Important Settings

One of the most overlooked settings for services is the combination of check_interval, retry_interval and max_check_attempts. If you look at the examples above for the services your notice that:

  • NRPE Status
    • max_check_attempts 2
  • CPU Load,Disk Usage - C:
    • max_check_attempts 4

What this means is that the NRPE Status check will go "down" after 2 check attempts and the CPU Load and Disk Usage - C: services will go "down" after 4 check attempts.

Why is this important?

It means that the NRPE Status service is guaranteed to do down BEFORE the other services and at this point the service dependency will take affect. If they all had the same max_check_attempts value, then it is very possible that one of the other services could go down BEFORE the NRPE Status service and then notifications would be sent.

Of course  check_interval and  retry_interval need to also be taken into consideration, however in this case they both have the value of 1 for the purpose of keeping the example simple.

Single Host

A service definition for a single host is a good starting example, the definition for this is as follows:

define servicedependency {
    host_name host1
    service_description
NRPE Status
    dependent_service_description
CPU Load,Disk Usage - C:

    inherits_parent 1
    execution_failure_criteria u,c,p,
    notification_failure_criteria u,c,p,
    dependency_period 24x7
    }


What does this mean?
  • host_name host1
    • The host being depended upon
  • service_description NRPE Status
    • This is the service that is acted as the "master service", the service that is being depended upon
  • dependent_service_description CPU Load,Disk Usage - C:
    • These are the services that are dependent on the NRPE Status service
  • execution_failure_criteria u,c,p
    • When NRPE Status service is in the UNKNOWN (u), CRITICAL (c) or PENDING (p) state, the dependent services CPU Load, Disk Usage - C: will NOT be executed
    • If you watch these services in Nagios, you'll see that they keep being scheduled for the next check interval defined in the service. When this time is reached, the check will NOT be executed and will simply be reschedule. This leaves the service in the same state it was in BEFORE the "master service" went down
  • notification_failure_criteria u,c,p
    • When NRPE Status service is in the UNKNOWN (u), CRITICAL (c) or PENDING (p) state, the dependent services CPU Load, Disk Usage - C: will NOT have notifications sent out (if they were already in a state for notifications to be sent)


Multiple Hosts - Individually Defined Using host_name Directive

Now lets make the second host (host2) use the same dependencies. You can just add the host to the host_name directive, separating them with a comma:

define servicedependency {
    host_name host1,host2
    service_description
NRPE Status
    dependent_service_description
CPU Load,Disk Usage - C:

    inherits_parent 1
    execution_failure_criteria u,c,p,
    notification_failure_criteria u,c,p,
    dependency_period 24x7
    }


What does this mean?
  • host_name host1,host2
    • The hosts being depended upon
    • The multiple hosts have NO relationship with each other
  • The remaining directives remain the same
  • That was a simple way to make multiple hosts have the same dependencies
  • ALL hosts MUST have the same named services
    • If host2 did not have the CPU Load or Disk Usage - C: service then Nagios would fail to start

Multiple Hosts - Defined Using hostgroup_name Directive

The previous example allowed the dependency to be applied to multiple hosts. However this can become an administrative overhead each time you add a new host to be monitored, you would need to update the dependency to include the new host.

Using a hostgroup instead is a much simpler way to achieve this. Considering you need the same named services on each host it's more than likely you'll be using hostgroups to apply services to multiple hosts as per this example.

In that case you'll see I defined a hostgroup earlier called nrpe_hosts_windows and we need to use the hostgroup_name directive:

define servicedependency {
    hostgroup_name nrpe_hosts_windows
    service_description
NRPE Status
    dependent_service_description
CPU Load,Disk Usage - C:

    inherits_parent 1
    execution_failure_criteria u,c,p,
    notification_failure_criteria u,c,p,
    dependency_period 24x7
    }


What does this mean?
  • hostgroup_name nrpe_hosts_windows
    • Any host that is a member of this hostgroup will get the service dependency
    • The multiple hosts have NO relationship with each other
  • The remaining directives remain the same
  • Less administrative overhead applying service dependencies this way
  • ALL hosts MUST have the same named services
    • If host2 did not have the CPU Load or Disk Usage - C: service then Nagios would fail to start