Remediation with Sensu

Sensu is awesome for monitoring various systems, servers and applications. At Pantheon we’ve used it exclusively for almost two years, issuing hundreds of checks every minute, for everything from server ping times to remaining Pingdom credits to average API response time latencies.

When Sensu alerts about some condition, commonly an engineer will acknoledge the alert and do something to fix the issue, i.e. restarting a service leaking memory. After this monitor-alert-fix loop occurs several times, some scenarios allow for automatic remediation.

Remediation isn’t as good as a root-cause fix, but it allows us to monitor aggresively, notice abnormalities, and then resolve them without too much manual intervention.

One production example was the disturbing tendency for certain public-cloud instances
to drop their private network interface. Alerting on this was trivial, but it still
required on-call escalation and had customer-facing impacts. With a little remediation magic, Sensu could automatically restart the network interface in addition to logging and alerting if the issue persistd.

The remediator.rb handler was created to address this need. A testament to Sensu’s flexibility, the handler required no modifications to the core client/server code.

Here’s a check definition that includes remediation actions after certain conditions:

Here, the “check_something” check will run every 60 seconds on application_servers. If the check enters a warning severity (i.e. exists code 1), the remediator.rb handler will run the “light_remediation” check after the first and second warning occurence. Subsequently, the remediator.rb handler will trigger the “medium_remediation” for the third trough tenth warnings. If the check exits with a critical status (exit code 2), the “heavy_remediation” will be triggered at each critical failure.

The remediator.rb handler takes advantage of two lesser-known aspects of Sensu: unpublished checks and the publish-check API. Typically, checks are configured to run on a given interval. Setting the “publish” attribute to false prevents the check from being triggered on clients. This allows for the definition of commands that are not actually ‘checks’ per say, but actually arbitrary commands for remediation.

Once defined, the Sensu API makes it trivial to tigger the remediation “check”. The handler makes use of this API, simply parsing the desired conditions from the check definition and triggering the application action.

This may or may not fit your needs or solve your use-cases, but if nothing else it shows Sensu’s flexibility. As Confucious said:

“A monitoring system that bends will not break”

7 thoughts on “Remediation with Sensu

  1. This is a great tool, but I have one issue
    it does trigger a restart service on all subscribed clients not just the ones affected, which is not optimum and may cause disruptions
    how can it be triggered to the faulty client only?

  2. Hey,

    It should restart the service on only the client that triggers the event. They way I do that is to have each client subscribe to a unique queue, i.e. it’s hostname. Then, I can trigger a check on that specific client. That works with a few hundred nodes, not sure about the overhead for thousands.

  3. Good catch! I just updated to change ‘handler’ to ‘handlers’, which allows for an array containing the handler names. Cheers

  4. Hope you can help. The check now appears to be working but it doesn’t actually run on the affected server:

    {“timestamp”:”2014-04-02T11:56:23.168189-0400″,”level”:”info”,”message”:”handling event”,”event”:{“client”:{“name”:”testserver.local.net”,”address”:”10.1.2.3″,”subscriptions”:["default","test"],”timestamp”:1396454165},”check”:{“command”:”/usr/bin/ruby /etc/sensu/community/plugins/processes/check-procs.rb -p jenntest -W 1″,”interval”:60,”subscribers”:["test"],”handlers”:["remediator"],”remediation”:{“light_remediation”:{“occurrences”:[1,2],”severities”:[1]},”medium_remediation”:{“occurrences”:["3-10"],”severities”:[1]},”heavy_remediation”:{“occurrences”:["1+"],”severities”:[2]}},”name”:”check_with_remediation”,”issued”:1396454183,”executed”:1396454183,”output”:”CheckProcs CRITICAL: Found 0 matching processes; cmd /jenntest/\n”,”status”:2,”duration”:0.104,”history”:["2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2"]},”occurrences”:1212,”action”:”create”},”handler”:{“type”:”pipe”,”command”:”/etc/sensu/community/handlers/remediation/sensu.rb”,”name”:”remediator”}}
    {“timestamp”:”2014-04-02T11:56:23.282658-0400″,”level”:”info”,”message”:”handler output”,”handler”:{“type”:”pipe”,”command”:”/etc/sensu/community/handlers/remediation/sensu.rb”,”name”:”remediator”},”output”:”REMEDIATION: Evaluating remediation: testserver.local.net {\”light_remediation\”=>{\”occurrences\”=>[1, 2], \”severities\”=>[1]}, \”medium_remediation\”=>{\”occurrences\”=>[\"3-10\"], \”severities\”=>[1]}, \”heavy_remediation\”=>{\”occurrences\”=>[\"1+\"], \”severities\”=>[2]}} #=1212 sev=2\n”}
    {“timestamp”:”2014-04-02T11:56:23.282813-0400″,”level”:”info”,”message”:”handler output”,”handler”:{“type”:”pipe”,”command”:”/etc/sensu/community/handlers/remediation/sensu.rb”,”name”:”remediator”},”output”:”REMEDIATION: Matchdata: #\n”}
    {“timestamp”:”2014-04-02T11:56:23.282856-0400″,”level”:”info”,”message”:”handler output”,”handler”:{“type”:”pipe”,”command”:”/etc/sensu/community/handlers/remediation/sensu.rb”,”name”:”remediator”},”output”:”REMEDIATION: Triggering remediation check ‘heavy_remediation’ for [\"testserver.local.net\"]\n”}
    {“timestamp”:”2014-04-02T11:56:23.282892-0400″,”level”:”info”,”message”:”handler output”,”handler”:{“type”:”pipe”,”command”:”/etc/sensu/community/handlers/remediation/sensu.rb”,”name”:”remediator”},”output”:”REMEDIATION: Recieved API Response (202): {\”issued\”:1396454183}, exiting.\n”}

    I have the check doing a /bin/touch /tmp/remedation but no file is created on the test server.

  5. One thing that might not be obvious at first is that it sends the remediation to the client using a “subscription” named after the client name. So, in your case if your system does not have a subscription named “testserver.local.net” it will not get the remediation check request.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>