A monitoring system that sends notifications as soon as it detected that something went wrong would send a LOT of notifications. When in reality there may be a logical reason why a service failed to be checked, like it simply timed out sending a response. The next time the service is checked it responded correctly and everything was OK, there was really no reason to send a notification.
Nagios will check a host or service every X minutes and if there is a state change then it will re-check Y times Z minutes apart. check_interval X
retry_interval Z
max_check_attempts Y
The max_check_attempts directive allows you to define when you think Nagios should treat this as a real problem. While it's determining this, these are called SOFT states. So with real numbers:
check_interval 5
retry_interval 1
max_check_attempts 5
Here is what happens: - 13:10:10
- Nagios HOST check for host1 executed
- Result = OK
- HARD state
- Attempt 1/5
- Next scheduled check 13:15:10
- 13:10:30
- host1 dies
- Nagios does not know about this yet
- 13:15:10
- Nagios HOST check for host1 executed
- Check timed out as host is down
- Result = CRITICAL
- SOFT state
- Attempt 1/5
- No notification sent
- Next scheduled check 13:16:10
- 13:16:10
- Nagios HOST check for host1 executed
- Check timed out as host is down
- Result = CRITICAL
- SOFT state
- Attempt 2/5
- No notification sent
- Next scheduled check 13:17:10
- 13:17:10
- Nagios HOST check for host1 executed
- Check timed out as host is down
- Result = CRITICAL
- SOFT state
- Attempt 3/5
- No notification sent
- Next scheduled check 13:18:10
- 13:18:10
- Nagios HOST check for host1 executed
- Check timed out as host is down
- Result = CRITICAL
- SOFT state
- Attempt 4/5
- No notification sent
- Next scheduled check 13:19:10
- 13:19:10
- Nagios HOST check for host1 executed
- Check timed out as host is down
- Result = CRITICAL
- HARD state
- Attempt 5/5
- Notification SENT
- Next scheduled check 13:24:10
- Next notification 13:49:10
- 13:24:10
- Nagios HOST check for host1 executed
- Check timed out as host is down
- Result = CRITICAL
- HARD state
- Attempt 5/5
- No Notification SENT as Next notification is at 13:49:10
- Next scheduled check 13:29:10
- 13:27:20
- host1 recovers
- Nagios does not know about this yet
- 13:29:10
- Nagios HOST check for host1 executed
- Result = OK
- HARD state
- Attempt 1/5
- Next scheduled check 13:15:10
Basically the logic is HARD OK > SOFT > HARD non-OK > HARD OK. There is no SOFT state when transitioning back to HARD OK.
|