December 11, 2016
I follow a simple rule before configuring a monitoring alert: if I receive this alert at 3am, will I act on it?
If not, it shouldn't be an alert.
Few performance-related alerts meet this criteria. For example, if our app is running 25% slower, it's not worth a hasty 3am fix, but it is worth a first-thing-in-the-morning effort.
That's the drive behind a feature we'll make available soon: The Digest Email. Available in daily or weekly editions, the Digest Email summarizes your app performance and directs you to bottlenecks with ease:
At a frequency of your choice (daily or weekly), we'll crunch the numbers on your app's performance (both web endpoints and background jobs). Performance is compared to the previous week, and highlights are mentioned in the email.
To start, there's three specific areas we're focusing on.
It's easy to just grab endpoints with large changes in their mean response time between today and last week. However, that adds significant noise: a rarely used endpoint, like
UsersController#forgot_password, may vary widely in response time. Is it worth the development performance effort if response times are bouncing between 100 ms - 500 ms? Frequently, the answer is no.
Scout works hard to identify significant trends. Some of the approaches our algorithms apply:
To make tracking down the source of trends easier:
What if an endpoint is fine for 90% of users, but it becomes extremely slow for a small subset of users? The small percentage of users experiencing performance problems are frequently high-paying power users that are pushing your app the hardest. For example, a controller-action that renders all employees at a startup will load quickly while that same endpoint would fall over if that company was Apple.
Additionally, these very slow outliers can trigger frustrating capacity problems, and in a worst-case scenario, momentary downtime. It's far more difficult to determine the application capacity you need to serve your app when response times vary widely (Little's Law isn't valid across a wide distribution of response times).
We highlight endpoints that are triggering these slow outliers, but that's not all. We also identify any significant bottlenecks (example: a slow ActiveRecord query).
Bonus: if you've setup our GitHub integration, you'll see who last touched any expensive code paths.
Our subject line is dynamic, changing with your aggregrate app performance. Here's an example:
If performance isn't changing, it's important to know that too:
Also, we display a friendly emoticon when things are going well:
It's a nice, friendly reward.
The goal: if things haven't changed, there's no need to open the email. If we think there's something worth investigating, we'll draw your attention.
We're limiting the number of recipients as we tune our algorithms based on your feedback. Enable the Digest Email in your user settings to ensure you'll be in our first access group.
Most app performance issues don't warrant immediate, one-off alerts, but they do warrant a holistic per-day or per-week review.
The Scout Digest Email aims to address this while identifying the source of issues.