Sensu is awesome for monitoring various systems, servers and applications. At Pantheon we’ve used it exclusively for almost two years, issuing hundreds of checks every minute, for everything from server ping times to remaining Pingdom credits to average API response time latencies.
When Sensu alerts about some condition, commonly an engineer will acknoledge the alert and do something to fix the issue, i.e. restarting a service leaking memory. After this monitor-alert-fix loop occurs several times, some scenarios allow for automatic remediation.
Remediation isn’t as good as a root-cause fix, but it allows us to monitor aggresively, notice abnormalities, and then resolve them without too much manual intervention.
One production example was the disturbing tendency for certain public-cloud instances
to drop their private network interface. Alerting on this was trivial, but it still
required on-call escalation and had customer-facing impacts. With a little remediation magic, Sensu could automatically restart the network interface in addition to logging and alerting if the issue persistd.
The remediator.rb handler was created to address this need. A testament to Sensu’s flexibility, the handler required no modifications to the core client/server code.
Here’s a check definition that includes remediation actions after certain conditions:
Here, the “check_something” check will run every 60 seconds on application_servers. If the check enters a warning severity (i.e. exists code 1), the remediator.rb handler will run the “light_remediation” check after the first and second warning occurence. Subsequently, the remediator.rb handler will trigger the “medium_remediation” for the third trough tenth warnings. If the check exits with a critical status (exit code 2), the “heavy_remediation” will be triggered at each critical failure.
The remediator.rb handler takes advantage of two lesser-known aspects of Sensu: unpublished checks and the publish-check API. Typically, checks are configured to run on a given interval. Setting the “publish” attribute to false prevents the check from being triggered on clients. This allows for the definition of commands that are not actually ‘checks’ per say, but actually arbitrary commands for remediation.
Once defined, the Sensu API makes it trivial to tigger the remediation “check”. The handler makes use of this API, simply parsing the desired conditions from the check definition and triggering the application action.
This may or may not fit your needs or solve your use-cases, but if nothing else it shows Sensu’s flexibility. As Confucious said:
“A monitoring system that bends will not break”