October 27, 2017
Before we talk performance, lets talk entropy. Entropy usually refers to the idea that everything in the universe eventually moves from order to disorder, and entropy is the measurement of that change.
Like entropy, the performance of a Rails app will trend toward disorder. An N+1 database query here, a forgotten pagination implementation there, a missing index here, etc. This performance debt builds over time, and suddenly...we've got a slow app.
Where do you start knocking down this performance debt? Surely, not everything is slow, right? Let's perform a Rails performance audit.
In 10 minutes or less, you'll have a good idea of where your app stands and where to focus your efforts by following this 5-point performance audit. At each step of the audit, I'll work through the analysis on a real production app so you can see an audit applied.
Performance in almost all web apps - including Rails - follows an 80/20 rule: most of your performance problems will be contained within a small amount of the application code. This is a great thing: most of the time, you don't need to litter your code base with performance hacks. You don't need to optimize everything.
It's your job - with the help of production performance monitoring tools - to identify these performance hot spots. In this post, we'll use Scout to perform the audit.
Can you get by without production monitoring? The production version of your app will behave very differently than your code-reloading, trivial database, no traffic, single-user development app. I wouldn't recommend it.
You're here because your app is running slow, so let's start by looking at your response times. In Scout, this is front-and-center when viewing your app. For this audit, change the timeframe to 7 days vs. the default timeframe. Many apps have seasonal trends - like higher throughput during the business day - and the longer timeframe will reveal the general profile of the app:
Each element of the stacked bar represents a layer of your stack (ex: database, external HTTP calls, Ruby). Added together, this displays the average response time across all requests to the app.
How's the example app look?
Pretty good!
This is likely an app used during the business day as traffic and response times peak around 12pm.
While response times increase from 40 ms to 70 ms during these peak periods, we can't yet conclude there is a scaling problem: customers may be using heavier controller-actions that aren't used during off periods. Let's note this and dig in later.
Typically, there's just a couple of layers in the stack that are responsible for most of the time your app spends responding to web requests. We can narrow this down by looking at the timeseries chart. This is a hint at where we'll be focusing our time.
There's a special metric on the response time chart: QueueTime
. This measures the time from when a request hits a load balancer till when it is first processed by your application server (ex: Puma):
You'll want this value to remain under 20 ms. If this exceeds 20 ms, you have a capacity issue somewhere in your stack (most commonly at the app server or database).
How's the example app look?
We've gained a great picture of our's performance profile. Now, it's time to look at the response time numbers and determine how much trouble we're in. Beneath the timeseries chart in Scout you'll find a series of sparklines. The numbers here represent the averages over the given time period (7 days, in our case):
We'll look at the "response time - mean" sparkline first. Here's some rules of thumb on response times:
Requests Per-Minute | Classification |
---|---|
< 50 ms | Fast |
< 300 ms | Normal |
> 300 ms | Slow |
If you're just serving JSON content for an API server, response times should be smaller...perhaps 100 ms is slow in your case.
How's the example app look?
Response times are fast...but, the mean response time don't tell the whole story.
A single, fast, high-throughput controller action can drastically lower an app's mean response time. The mean is a great place to start, but it doesn't provide the entire picture. For a broader picture on your app's performance, you'll want the 95th percentile response time as well:
The 95th percentile response time says that 95% of requests have response times at or below this number. Conversely, 5% of response times are above this threshold. You'll want the 95th percentile response time to be no greater than 4x the mean response time. If this ratio is greater than 4:1, your app may have some controller-actions triggering significantly longer response times.
How's the example app look?
The 95th percentile response time (162 ms) is 3.2x greater than the mean response time (51 ms). This falls within the 4:1 ratio, but it's close enough to the max that there may be some slow controller-actions within the app.
As a general rule, the greater the throughput, the more difficult performance work becomes. As throughput grows, the underlying services are generally more complex and there are more business tradeoffs to consider when doing performance work (ex: what do when an endpoint is slow for a single, high-paying customer?).
Generally:
Requests per-minute | Scale |
---|---|
<50 rpm | Small |
50-500 rpm | Average |
500+ rpm | Large |
How's the example app look?
Our mean throughput is 240 rpm with spikes up to 350 rpm. This is an average application in terms of throughput. There's likely a decent number of knobs we can turn.
We now have a solid 10,000 foot of our app's performance charismatics and health. These are the questions we asked:
QueueTime
exceeds 20 ms, indicating a capacity problem?If you're a developer like me, you're anxious to start digging into code and fixing these problems. In my next post for this series, we'll dive into your endpoints to identify where to focus your time. Use the form below to be notified via email: