HARD And SOFT States

A monitoring system that sends notifications as soon as it detected that something went wrong would send a LOT of notifications. When in reality there may be a logical reason why a service failed to be checked, like it simply timed out sending a response. The next time the service is checked it responded correctly and everything was OK, there was really no reason to send a notification.

Nagios will check a host or service every X minutes and if there is a state change then it will re-check Y times Z minutes apart.

    check_interval X
   
retry_interval Z
    max_check_attempts Y


The max_check_attempts directive allows you to define when you think Nagios should treat this as a real problem. While it's determining this, these are called SOFT states.

So with real numbers:

    check_interval 5
    retry_interval 1
    max_check_attempts 5

Here is what happens:

  • 13:10:10
    • Nagios HOST check for host1 executed
    • Result = OK
    • HARD state
    • Attempt 1/5
    • Next scheduled check 13:15:10
  • 13:10:30
    • host1 dies
    • Nagios does not know about this yet
  • 13:15:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 1/5
    • No notification sent
    • Next scheduled check 13:16:10
  • 13:16:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 2/5
    • No notification sent
    • Next scheduled check 13:17:10
  • 13:17:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 3/5
    • No notification sent
    • Next scheduled check 13:18:10
  • 13:18:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 4/5
    • No notification sent
    • Next scheduled check 13:19:10
  • 13:19:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • HARD state
    • Attempt 5/5
    • Notification SENT 
    • Next scheduled check 13:24:10
    • Next notification 13:49:10
  • 13:24:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • HARD state
    • Attempt 5/5
    • No Notification SENT as Next notification is at 13:49:10
    • Next scheduled check 13:29:10
  • 13:27:20
    • host1 recovers
    • Nagios does not know about this yet
  • 13:29:10
    • Nagios HOST check for host1 executed
    • Result = OK
    • HARD state
    • Attempt 1/5
    • Next scheduled check 13:15:10


Basically the logic is HARD OK > SOFT > HARD non-OK > HARD OK.

There is no SOFT state when transitioning back to HARD OK.