Published on May 6, 2013
Notes on measuring latency from Gil Tene's 2013 QCon talk.

Video here: How Not To Measure Latency

Intro and motivation

A nice simple definition - how long something took. One event will look different from different contexts.

Computing is not continuous. Computers start and stop for many reasons. Glitches and noise have a big effect and are non-random. Often this sort of thing will look like a periodic freeze (e.g. naive garbage collection).

Response times are not uniformly or normally distributed. Their distributions have multiple modes - ‘ok’ or ‘bad’ or ‘impossibly terrible’.

It’s important to measure individual events over time, rather than just looking at some sort of average. Averaging will hide hiccups and multimodal behaviour, and doesn’t correspond well to how the events are perceived by a client.

Advice on analysis

There’s more point in measuring something if you know what you will do with the information - just as it is a good idea to know what action you will take for the likely answers before asking a question). For latency, this means we should be clear on our requirements. What is a pass and what is a fail? What level of failure is bad enough to wake you up (or at least remind you to get some sleep) at 3am?

Some steps for requirements elicitation:

  • “What is your latency requirement?”
  • Answers are often in ms. Translate “average” to “typical” where necessary.
  • “What is the worst you will accept?”
  • Negotiate closer to a true worst case by giving extreme examples
  • Clarify the middle ground - “How often is it ok to respond in 1 second?”
  • Agree SLA based on percentiles.

The output is a step function to use in testing and monitoring.

The system performance can be displayed on a histogram with percentiles on the x axis and time taken on the y axis. The step function can be displayed on the same chart.

Once you know the step function you can do refactoring and optimisation while still fulfilling requirements.

Some requirements examples:

  • A pacemaker’s main requirement is “Must never be slower than x”
  • An Olympic race’s main requirement is “Must occasionally be extremely fast”
  • A trading platform’s main requirement is “Must be typically very fast” - but this example is more complicated as you also need to manage risk and assure customers - the whole spectrum is interesting.

Advice on tech

Measurements of latency must not be coordinated with the operation under test. For example, don’t allow a load test to back off while waiting for a slow response.

Any measurement of maximum response time will typically be correct regardless of bugs or coordination in the test. Thus it can be used as a sanity check - how many sigmas is the maximum from the mean? (although, it may well be best not to bother with mean and standard deviation at all, with the possible exception of marketing). Always meaure the maximum and minimum times.

Problems with the measuring system can also be detected by running it against a stub system with known performance charateristics.

Some tools: