December 02, 2015
Here's a behind-the-scenes rundown of how we ensure our apps are in peak condition in
*Ed. Note- this is largely unchanged from 2016, so I've updated this post with our 2018 stack.
|Log Monitoring||ELK Stack||$150/mo|
|Server & Service Monitoring
|Pingdom Server Monitor||$200/mo|
|Scheduled Job Monitoring||Deadman's Snitch||$15/mo|
There's no single, do-everything tool that completely monitors a modern-day Rails stack. If there was, it'd be the software equivalent of the Homer Simpson-designed car. There's simply too many specialized things to put into a single monitoring app.
However, there's good news: a number of best-of-breed services play well together to give you great monitoring coverage of your Rails apps and infrastructure.
When picking a monitoring solution, you can typically choose between two options:
The upsides of open source: free to install and more customizable. The downsides: generally more difficult to use and fairly complex to maintain.
Most of the monitoring services we use are SaaS. We typically only use open source options when the paid, hosted option is significantly cost prohibitive. Monitoring software is complicated and keeping your own stack running can be a time-sink. The last thing we want is unreliable software monitoring our apps.
Here's the primary areas we monitor:
I'll cover each area in detail below.
This is the basic building block of monitoring. Whether you are hosting a personal blog or on the stability team for Facebook, you need this. Uptime monitoring tells you if your app is down, but no details beyond that.
I'm not aware of a good open source option for this. It's also not an area I'd spend a lot of time investigating: running a geographically distributed network of servers to monitor uptime would be complicated. Paid options are very affordable.
We use Pingdom, the Kleenex of uptime monitoring. Pingdom starts at $15/mo for 10 checks. We've found the UI to be a bit heavyweight for our needs, but the service has been nothing but reliable over the years. While there are many other options, I've been hesitant to swap out Pingdom for anything else for this basic building block.
We check two primary controller-actions in each of our Rails apps:
When it comes to tracking down performance issues, application monitoring gives you the most value with the least effort. Finding an application monitoring tool can be confusing - lots of tools say they are application monitoring.
So, lets define application monitoring: application monitoring is the ability to point to a line-of-code when there is a performance problem. With that definition, there's a much more narrow scope.
We don't know of a widely-used open source solution for Rails app monitoring.
On larger teams, this is typically most used by developers as it ties directly to code they have written. Folks on the DevOps side are more concerned with higher-level performance metrics than application code.
Choosing between Scout and New Relic?
Scout digs through performance data for you, identifying slow database queries, N+1s, sources of memory, and more. It's a jobs-to-be-done approach to monitoring.
Logs are the lowest common denominator of monitoring.
In most modern setups, you are likely using multiple application servers served behind a load balancer. This means if there is an issue you need to track down, you'd need to find it on the right server. To solve this problem, send your logs to a central service.
There are a couple of options, but we use the ELK stack (ElasticSearch, Logstash, and Kibana).
Both developers and the devops team will likely use this tool, so it's important that both are comfortable with it. Our customers tell us Papertrail is the easiest option if you're getting going with log monitoring.
This is the only part of our monitoring stack that is open source, soley because our bill using a paid service would eclipse our hosting bill. A monitoring application like ours does a lot of logging.
We use the Lograge gem in our Rails apps to generate more readable, structured log files.
Ensure the servers hosting your app and the the services running on them are behaving. There are a number of options in the server monitoring space - probably more than any other area.
Common use cases of server monitoring:
The standby is Nagios. However, out of any open source monitoring solution I've mentioned in this post, I'd say Nagios is the most difficult to install, use, and maintain. If you go the open source path, I'd suggest trying Sensu.
Paid services often combine a couple useful services together. For example, when you use Pingdom Server Monitor, you also get StatsD and AWS monitoring. For many open source tools, you need to combine several unrelated pieces of software together to do charts, monitoring, alerting, and custom metrics. This makes monitoring more brittle.
Besides Pingdom Server Monitor, Datadog is another common option. Pricing is around $15/server/mo.
On larger teams, this is most frequently used by devops. For smaller teams, it's important that your developers are comfortable with the server monitoring tool as well (some server monitoring tools have very poor user experiences).
Exception Monitoring tools make it easy to track exceptions down to a line-of-code, saving you valuable development time hunting down bugs. They also aggregate similar errors together to decrease noise when things are going wrong.
Sentry has an open source option (and a paid service - see below).
You can also just use the Exception Notification Gem to notify you of exceptions via email. This doesn't scale well: you'll be overwhelmed with emails during peak outage periods (like a database server going offline).
Scout also integrates with Sentry and Rollbar, providing a single view of your app health:
Pricing starts around $30/mo.
We classify exceptions into 2 areas:
Bugs in code that need to be fixed (this is where Sentry, Honeybadger, etc) are most useful.
Transient errors that can occur but are typically only problems if they occur at a high rate and/or over an extended period of time (ex: database timeout errors).
For (2), we use StatsD (see Custom Metrics below) and set alerting thresholds on error rates.
Every app has key indicators that ensure things are working. For example, with Scout, we monitor the number of active servers and watch for large drops. These can indiciate network issues between a customer's datacenter and ours.
StatsD is a terrific, lightweight tool for custom metrics. If you are logging numbers, it often makes sense to put those numbers into StatsD.
Graphite is the standard dashboard solution that can accept StatsD metrics. The downside is alerting isn't included.
Both Pingdom Server Monitor's and Datadog's monitoring agents accept StatsD metrics - this allows you to view+alert on most of your metrics (server, services, and StatsD) from a single service.
Librato is a more general hosted metrics service.
StatsD is a lightweight protocol (can even report metrics via bash), so you'll have universal support for custom metrics across languages and frameworks.
Are your backups really running? Is your one-per-day billing script completing successfully? That's where monitoring scheduled jobs comes into play.
I'm not aware of any options, but paid options are very affordable, starting at $5/mo.
We use Deadman's Snitch, which starts at just $5/mo.
Deadman's Snitch is easy to setup: just hit an assigned URL when a job completes.
Here's our current monitoring stack at Scout entering 2016:
We're happy to share more details. Ping us at firstname.lastname@example.org. Share your suggestions in the comments below.