July 25, 2013
Netflix tracks CPU Steal Time closely. In fact, if steal time exceeds their chosen threshold, they shut down the virtual machine and restart on a different physical server.
If you deploy to a virtualized environment (for example, Amazon EC2), steal time is a metric you'll want to watch. If this number is high, performance can suffer significantly. What is steal time? What causes high steal time? When should you be worried (and what should you do)?
Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.
Your virtual machine (VM) shares resources with other instances on a single host in a virtualized environment. One of the resources it shares is CPU Cycles. If your VM is one of four equally sized VMs on a physical server, its CPU usage isn't capped at 25% of all CPU cycles - it can be allowed to use more than its proportion of CPU cycles (versus memory usage, which does have hard limits).
When you run the Linux
top command, you'll see a realtime view of key performance metrics. One of the lines is for the CPU:
Two metrics you might have some experience with already are
%id (percent idle) and
%wa (percent I/O wait). If
%id is low, the CPU is working hard and doesn't have much excess capacity. If
%wa is high, the CPU is ready to run, but is waiting on I/O access to complete (like fetching rows from a database table stored on the disk).
%st, or percent steal time is the last CPU metric displayed.
You've purchased tickets to the latest Hollywood blockbuster. There are two lines and one ticket booth:
If we applied a CPU steal time-like metric to the ticketing process, it would look like this:
If you have a long-running background computational task that is on an underutilized physical server, it may get access to more than it's share of CPU cycles for a while. Later on, the other VMs need their share of CPU Cycles, so the long-running task will run slower. This might not be a deal-breaker for a long-running task: it might take a bit longer or it might even finish faster (since it was able to use more resources earlier).
However, for web apps, this can bring things to halt. For tasks that need to be performed in real-time, like rapidly serving many web requests, a 4x decrease in performance can cause major backups in request queues, which can lead to outages.
There are two possible causes:
The catch: you can't tell which case your situation falls under by just watching the impacted instance's CPU metrics. This is easiest to tell when you have multiple, identical servers performing the same roles, each residing on a different host:
%st(CPU Steal Time Percentage) increased on every virtual server? This means your virtual machines are using more CPU. You need to increase the CPU resources for your VMs.
%st(CPU Steal Time Percentage) increased dramatically on only a subset of servers? This means the physical servers may be oversold. Move the VM to another physical server.
A general rule of thumb - if steal time is greater than 10% for 20 minutes, the VM is likely in a state that it is running slower than it should.
When this happens:
Scout's CPU Usage Plugin reports key CPU metrics, including steal time. You can create a trigger to alert you of spikes in steal time.
In a virtual environment, CPU cycles are shared across virtual machines on the server. If your VM displays a high
%st in top (steal time), this means CPU cycles are being taken away from your VM to serve other purposes. You may be using more than your share of CPU resources or the physical server may be over-sold. Move the VM to another physical server. If steal time remains high, try giving the VM more CPU resources.