I recently read an interesting post on monitoring theory by Ted Dziuba. The post colorfully classifies different monitors by their required reaction.
Dziuba's monitor classifications:
- Neither Actionable nor Informative
- Informative, not Actionable
- Actionable and Informative
I'd add a tentative fourth, to the complete the permutations: Actionable, not Informative. This is the worst of all, the equivalent of your boss screaming "It's DOWN! It's DOWN! Stuff if BROKEN!"
Why monitoring theory is new in 2011
This perspective on monitoring theory comes out of read-world experience dealing with software systems. The first instinct when systems start failing is to add monitors to every circumstantial piece of evidence. Load average spiked? Monitor and alert. Replication delay? Monitor and alert. This naive monitoring and alerting will keep your system up, but creates an unsustainable environment for developers and system administrators. With so many alerts flying around, pretty soon your team is either ignoring them or severely sleep-deprived and grumpy.
Dziuba's theory then comes into play here, identifying reactionary measures, classifying monitors, and alerting/persisting with the correct tools such that your system gets the correct amount of attention at the correct time. Monitoring is not new, and disgruntled sysadmins are not new, but there is something about the devops movement that is changing monitoring.
As development and operations align, it is easier to picture sysadmins as bonafide users, and users have the right of an optimal user-experience.
What you need to do, when
Here's a little more detail about each level of monitor: the use-case for investigating, and likely frequency. Like user-features, monitoring should be approached with use-cases. Developers and Sysadmins are users, and their use-cases and experiences should be defined before solutions are implemented.
Neither Actionable nor Informative
Frequency: monthly, quarterly
Purpose: Look at for capacity planning, roadmapping
Informative, not Actionable
Frequency: Daily, weekly
Purpose: Take preventative actions
(Actionable, not Informative)
Purpose: Figure out what is wrong, and then fix it (suck)
Actionable and Informative
Purpose: Execute the predefined steps to correct the error
Communicating and persisting alerts
There are dozens of tools to help with monitoring, and almost as many different forms of communication. Here are few methods for recording monitor alerts. They key distinction is how much (or how strongly) a specific medium will demand attention. These are listed in order from least attention-demanding to most.
- Log to file
- Graph (munin, collectd, graphite)
- Record (dashboard, for later viewing)
- Twitter feed
- Digest Email
- Acknowledged Page/SMS
High-level monitoring objectives
As Dziuba notes, a lot of this is subjective. In one system, maybe a queue with 10 items is a problem. In another, maybe 1000 items is fine. The high-level monitoring use-cases give a good picture of the high-level monitoring objectives:
- What do you want to keep running?
- How do you deterministically monitor the health of those systems?
- What secondary and tertiary monitors might give some insight into the system, and how can those be effectively monitored without going crazy?
Monitor specific questions
Here are a few questions that might narrow-done the correct alerting tool for a particular monitor.
- If X happens, is user experience necessarily broken?
- If X happens, is user experience necessarily degraded?
- If X happens, doing Y will fix the problem
- If you go back to sleep, will X likely return to normal state?
If user experience is significantly degraded, and there are concrete actions that will correct the issue, go ahead and demand a reaction.
Here are a few concrete scenarios we can put to test:
Replication outside delay
Probably will fix itself, this time, but you should know in general what your delay is. Let's graph this in general, maybe jabber if it is 15 minutes behind, and email if it is 1 hour behind.
Chef run failed
Thanks to convergence we can be pretty sure this will sort itself out. In specific circumstances, this could be an issue, so maybe Jabber the error so it's on the radar.
NTP Time Update failed
In all but the most finicky systems this should be fine, it'll correct itself soon. No alerting.
Backend processor CPU/Load Average
This is what processors do. Graph, no alerts.
Digest email of processing errors for the past day
These may be actionable or they may not. The digest is a pretty good way to reduce inbox traffic, but monitors are either they are actionable or they are not. If they are actionable, demand more attention. If they are merely informational, a web dashboard is probably better.