How NOT to measure latency
notes date: 2016-08-31
source date: 2016-03-26
- Chart showing latencies of 25%ile, 50%ile, 75%ile, 90%ile, 95%ile
- note: this hides the 5% worst requests
- why not show the max? what are you hiding (from)?
- because it obscures all the other values? can it be its own chart?
- Is 99%ile a good enough indicator?
- if someone’s loading a single web page, what’re the chances of them experiencing >99% of one of these?
- even a single google.com page load triggers ~30 requests
- if someone’s loading a single web page, what’re the chances of them experiencing >99% of one of these?
- What HTTP response time metric is more representative of user experience?
- If a typical user session is 5 page loads, averaging 40 resources per page. Visitors never experiencing something worse than the 95%ile are ~0.003% of visitors
- 99.9%ile is ~18%
- The main cause for not measuring %iles to “enough 9s” is tooling: we’re forced to choose a roll-up duration that can make a point at all roll-up intervals
- HdrHistogram
- Coordinated Omission: skews to lose samples that would have been worse than the samples taken
- Load testing
- load generator throws requests at a certain rate
- if you fail to throw a request at the expected rate (because you’re waiting for responses to previous ones), you’re coordinating with the service in a way that avoids measuring it at a time when it would underperform
- Unavoidable in many tech stacks: TCP is a synchronous protocol
- Monitoring code
- startTime slowOperation endTime log(endTime-startTime)
- an event that occurs outside of this timing window is not measured
- all operations occurring before startTime and after endTime
- any request which does not perform slowOperation
- e.g., whole system freezes
- How cassandra measures read latency on the server
- (it’s not clear there’s a better way)
- Load testing
- Response Time vs Service Time
- Service Time: how long it takes to make a coffee
- Response Time: how long it takes to get from the back of the line to having our coffee
- Coordinated Omission usually makes something that you think of as a Response Time metric to only represent a component of the Service Time
- Sanity Check
- Service Time at either side of saturation looks the same, but response time should build up a backlog (and grow over time)
- Sustainable Throughput: The throughput achieved while safely maintaining service levels
- Don’t bother measuring latency close to saturation: you will never (want to) run a production service at saturation anyway