DevOps UX: Rewriting a Sensu check for the operator

tl;dr: Taking operator use-cases into consideration when writing monitoring tools will yield more actionable and more pleasent usage.


Digging in

Let’s examine a Sensu check that looks in Journald for a matching log-messages, alerting based on the number of matching log messages.

Here’s the original output, which makes my head hurt (especially when I groggily wake up from REM sleep to address a PagerDuty alert).


Executing 'journalctl --no-pager -a MESSAGE_ID=72395a954c4e4449921c42a5faab671c --since=-10minutes' Match: -- Logs begin at Thu 2016-03-03 02:42:33 UTC, end at Thu 2016-03-17 19:39:37 UTC. -- Ignoring: -- Logs begin at Thu 2016-03-03 02:42:33 UTC, end at Thu 2016-03-17 19:39:37 UTC. -- Match: Mar 17 19:32:20 endpoint091c557c.xhios.blark.io chef_solo_slow_converge[14314]: Orphaned binding detected: b517d0d6ad3d45f4b5eeba63be6f454a, this will cause slow convergence to eventually back up. Match: Mar 17 19:33:11 endpointac9c557c.xhios.blark.io chef_solo_slow_converge[14314]: Orphaned binding detected: 473199549e5844f097ae750c299837ec, this will cause slow convergence to eventually back up. CheckJournal WARNING: 2 matches found for .* in `journalctl --no-pager -a MESSAGE_ID=71395a954c4e4449921c42a5faab671c --since=-10minutes` (threshold 1)


Oww! Brain pain! We can do way better.

Let’s think through some of the ways a user would interact with this check:

  • Responding to an escalation and understanding what the situation is
  • Remediating the issue
  • Debugging the monitor to fine-tune it

That’s about all the use-cases I can think of. The original check didn’t serve any of these use-cases well. One culprit is the “--verbose” flag. This flag is generally used to request extra output, although it doesn’t consider the role of the requester, so it is almost guaranteed to spew out useless info.

Thinking of the specific use-cases, it is straightforward to improve this check by adding flags that tailor the output to the needs of the operater in a specific scenario.

Here’s how we’ll improve the check output:

  • --show-matches shows matching log messages for someone that is addressing the issue
  • --debug will show debugging info for someone trying to debug the monitor itself

A few other quick fixes:

  • Parse the message from journal, removing unnecesary journal context (date, hostname proc ID)
  • Move the alerting thresholds to the main message (not trailing), include both the warning/critical thresholds

And here’s the final output:


CheckJournal WARNING: 2 matches found for Orphaned binding in journal (warn 1/crit 10)

Orphaned binding detected: a417d0d6ad3d45f4b5eeba63be6f454a, this will cause slow convergence to eventually back up.
Orphaned binding detected: 473199549e5844f097ae750c299837ec, this will cause slow convergence to eventually back up.

Results

We’ve gone from 110 characters to 40 characters, with no loss of useful data. And thus, our alerting is one little better more user-friendly and actionable than it was at the start of the day.

“Always rising, never steeply”