Our poor choice of words: "Server Down" alerts are now "No Data" alerts
October 18, 2011
What we’ve been referring to as “Server Down Notifications” suffer from two problems:
- False Positives – The last thing you need is an alert from Scout at 3am telling you that your server is down when it isn’t.
- An overaggressive email subject – The subject line of these emails looks like “Server is DOWN”. However, this isn’t accurate. We just know that Scout hasn’t received data from this server. That doesn’t necessarily mean the server is down.
We’re making two changes to these alerts:
- The email subject now states “Server isn’t reporting”. This is really all we know.
- These alerts used to fire when the agent didn’t report for five minutes. This was overly aggressive. While it doesn’t happen often, things can go wrong between your server and ours that can cause a reporting pothole. Alerts are now sent when the agent hasn’t reported for 30 minutes. It’s important to tell you when monitoring isn’t working, but not important enough to risk sending a false positive because of a short reporting outage.
So, what do we suggest to verify that a server is down? An external monitoring service like Pingdom is a good option. Services like Pingdom can alert you if a server can’t be pinged and/or isn’t accepting SSH/TCP connections. Failed external checks and a lack of internal metrics from Scout often indicate that a server really is dead.
We’re confident (1) keeping our finger off the alarm button a bit longer and (2) calling these alerts what they really are will give you a more comfortable experience. We’re developers too and we know the heartburn a 3am SMS causes.