How NOT to measure latency

2016-08-31

https://www.infoq.com/presentations/latency-response-time

Gil Tene

2016-03-26

Chart showing latencies of 25%ile, 50%ile, 75%ile, 90%ile, 95%ile
- note: this hides the 5% worst requests
why not show the max? what are you hiding (from)?
- because it obscures all the other values? can it be its own chart?
Is 99%ile a good enough indicator?
- if someone’s loading a single web page, what’re the chances of them experiencing >99% of one of these?
  - even a single google.com page load triggers ~30 requests
What HTTP response time metric is more representative of user experience?
- If a typical user session is 5 page loads, averaging 40 resources per page. Visitors never experiencing something worse than the 95%ile are ~0.003% of visitors
- 99.9%ile is ~18%
The main cause for not measuring %iles to “enough 9s” is tooling: we’re forced to choose a roll-up duration that can make a point at all roll-up intervals
- HdrHistogram
Coordinated Omission: skews to lose samples that would have been worse than the samples taken
- Load testing
  - load generator throws requests at a certain rate
  - if you fail to throw a request at the expected rate (because you’re waiting for responses to previous ones), you’re coordinating with the service in a way that avoids measuring it at a time when it would underperform
  - Unavoidable in many tech stacks: TCP is a synchronous protocol
- Monitoring code
  - startTime slowOperation endTime log(endTime-startTime)
  - an event that occurs outside of this timing window is not measured
    - all operations occurring before startTime and after endTime
    - any request which does not perform slowOperation
    - e.g., whole system freezes
  - How cassandra measures read latency on the server
    - (it’s not clear there’s a better way)
Response Time vs Service Time
- Service Time: how long it takes to make a coffee
- Response Time: how long it takes to get from the back of the line to having our coffee
- Coordinated Omission usually makes something that you think of as a Response Time metric to only represent a component of the Service Time
Sanity Check
- Service Time at either side of saturation looks the same, but response time should build up a backlog (and grow over time)
Sustainable Throughput: The throughput achieved while safely maintaining service levels
- Don’t bother measuring latency close to saturation: you will never (want to) run a production service at saturation anyway