February 07, 2016
High monitoring overhead is a silent killer: your app's requests take longer, throughput capacity shrinks, end users requests start stacking up in a request queue, you react by provisioning more servers, and finally, more servers == more $$$.
So how does Scout's overhead compare with the competition? To find out, we set up a suite of benchmarks comparing Scout's overhead to New Relic.
To ensure fair results, every part of these tests is open-source - from the Rails app we're benchmarking to the Rails log files generated by the benchmarks. We encourage you to analyze the raw data, try these benchmarks on your own, and let us know if you come to a different conclusion.
App monitoring overhead varies based on (1) instruments used and (2) available resources on the application server.
That in mind, we benchmarked agent overhead in the following scenarios:
In these benchmarking tests, our metric of comparison will be response time. We're benchmarking a Rails 4.2.5 application running Ruby 2.2.3.
I've put the results below. Beneath this summary, you'll find details and analysis on each benchmark. The percentages below represent the increase in response time when each agent is installed. Lower is better:
|Benchmark||APM Agent||Response Time||Overhead|
21 database queries and 20 view partials per controller-action.
|New Relic||80.4 ms||44.5%|
1k database queries and 1k view partials per controller-action.
|New Relic||3,871.0 ms||37.7%|
1 database query and no view partials per controller-action.
|New Relic||3.71 ms||32.0%|
Every Rails app is a special snowflake, but this is a close approximation of a high-traffic, Rails app controller-action based on data we've collected from the apps we are monitoring.
This test hits 100 endpoints in a Rails app, with each controller-action conducting 21 database queries and rendering 20 view partials. Additionally, this forces 0.4% of requests to be > 2,000 ms as New Relic and Scout both collect extended details on slow requests. This test (and all of the others) hit 100 unique endpoints vs. a single endpoint as New Relic and Scout aggregate metrics by endpoint. We want to test the performance of that aggregation.
|APM Agent||Response Time
|None||55.6 ms||106.4 ms||2,174.1 ms|
|New Relic||80.4 ms||149.5 ms||2,263.5 ms||44.5%|
|Scout||56.8 ms||102.7 ms||2,168.7 ms||2.2%|
New Relic performs 20X worse than Scout. I was curious if there was a specific area of New Relic's instrumentation that was responsible for the overhead. However, looking at the New Relic Stackprof output, the stack samples are spread fairly evenly - their seven most expensive method calls are each greater than 0.5% of stack samples.
Note that when analyzing the Stackprof output, all of New Relic and Scout's processing and reporting work is done via a background thread. Those results are reflected in the Stackprof output.
Our representative endpoint benchmark hit controller-actions that had 21 database calls and rendered a partial 20 times per-endpoint. In this test, we simulate hitting controller-actions with 1,001 database calls and 1,000 view partials (a 50x increase in instrumented calls). APM agents instrument database calls and view partial rendering time, so we want to the see overhead when they are required to do a lot of instrumentation. This will ramp-up the average response time to around 2 seconds which is also the default threshold for when New Relic and Scout collect additional data on slow requests.
|APM Agent||Response Time
|None||2,811.1 ms||3,761.1 ms||5,922.3 ms|
|New Relic||3,871.0 ms||5,090.0 ms||7,057.7 ms||37.7%|
|Scout||2,992.8 ms||4,054.4 ms||6,216.4 ms||4.0%|
Both Scout and New Relic handle the increase in instrumentation gracefully: the overhead doesn't increase linearly as a percentage of the response time vs. the number of instrumented method calls.
Basically the opposite of testing endpoints doing lots of work, this tests an endpoint doing very little work. This controller-action conducts a single database call and renders text straight from the controller (no views).
Fast controller actions are important to optimize for, since a few milliseconds of additional time can constitute a significant percentage increase.
|APM Agent||Response Time
|None||2.82 ms||8.87 ms||125.8 ms|
|New Relic||3.71 ms||12.63 ms||193.5 ms||32.0%|
|Scout||3.06 ms||9.92 ms||116.6 ms||8.8%|
As this is such a fast controller-action with a 95th-percentile response time under 13 ms in all of our tests, any overhead will naturally appear high. For comparison, a bare-bones StatsD instrumentation tracking throughput, response time, and response codes amounts to 0.7% of the total Stackprof samples during the benchmark. Scout amounts to 3.2% of the samples.
Scout and New Relic do provide knobs that can be tuned to decrease instrumentation if an app is largely composed of very fast endpoints.
To run the tests, we provisioned 3 instances on Digital Ocean:
The utility server runs siege to benchmark the application server performance. Siege was run against a set of 100 endpoints using the same concurrency level as unicorn workers for a minimum of ten minutes. For example:
siege -v -f urls.txt -c 30 -b -t 10M
Database query response times can vary enough to skew test results. To prevent this, database rows are cached in the Rails process memory and fetched. Before recording benchmarks, each test begins with a one-minute siege to warm the cache so these larger initial query times aren't included in the results.
We used Ruby 2.2.3 and Rails 4.2.5. The application source can be accessed on Github. There is a dedicated branch for each testing variation. This makes it easy to ensure changes were applied consistently. For example here's the changes to enable Scout on the representative endpoint benchmark.
To simulate a real-world application, the application responds to 100 different endpoints. This adds diversity - APM agents aggregate metrics by endpoint, so we want to ensure we test that aggregation. Each endpoint actually does the same work - we care about diversity but want to keep tests consistent.
The most recent agent versions were used at the time of the benchmarks:
Logs of each test run were stored. We used the lograge gem to make parsing metrics from the log file easier. The numbers shared in these benchmarks were generated by parsing the log file of each test run.
Stackprof middleware, installed via the Stackprof gem, is disabled by default but can but enabled to gather CPU Samples across the different agents. Stackprof was disabled during the benchmark runs to eliminate any variability caused by its overhead and persisting data to disk.
Each agent is tested against its factory-set defaults. While some agents provide configuration options to turn off specific areas of instrumentation, the reality is few developers do this or understand the impact of what the settings do.
We threw out a benchmark if it didn't meet the following criteria:
Curious how we ensure our Ruby code stays fast? Just put your email into the sidebar. We deliver a curated selected of performance tips once-a-month.