Visualizing System Latency

2016-10-23

Brendan Gregg (Oracle)

2010-05-28

Introduction to Latency

Latency is time spent waiting
- “It has a direct impact on performance when induced by a synchronous component of an application request. This makes interpretation straightforward—the higher the latency, the worse the performance. Such a simple interpretation is not possible for many other statistics types that are commonly examined for performance analysis, such as utilization, IOPS (I/O per second), and throughput. Those statistics are often better suited for capacity planning and for understanding the nature of workloads.”

Usual objective: to examine the distribution over time
prior art for using heatmaps on disk I/O latency: taztool (1995)
time on x-axis, latency on y-axis
a color-shaded matrix of pixels, where each pixel represents via its color the number of operations finishing in a given time-range and latency-range
- darker colors for more operations, lighter colors for fewer
to be effective, the time-range and latency-range bins must be large enough for multiple operations to get grouped together, making patterns apparent
the relation between shade darknesses and number of operations should be superlinear
- a plurality of operations having the same latency will tend to cause weaker modes to be hard to distinguish from one another.
- Latency deviating from the norm is particularly important to examine, especially occurrences of high latency. Since these may represent only a small fraction of the workload—perhaps less than 1 percent—the color shade may be very light and difficult to see.
Outlier values can be illuminating (worst cases) but can cause re-scale the y-axis, compressing the bulk of the data
- An automatic approach can be to drop a percentage (say, 0.1 percent) of the highest-latency I/O from the display, when desired.
Keeping full event data
- Retaining full event data lets you recalculate heatmap from scratch if you wish
- But storing this data, and doing long-term recalculations of this type can be prohibitively costly and slow, respectively

DTrace: in-kernel collection and summarization of raw event data, passes it out to user space.
Analytics: tool to render and record DTrace data.
Figure 1: NFS latency when enabling SSD-based cache
- Before enabling the SSD cache, reads might have a hit in DRAM cache before having to read from disk. The pattern is of a small number of very fast DRAM cache hits (a bold but thin line at the very bottom of the heatmap, 0 - 21 microsec if you zoom in), and a cloud of slower disk hits (2-10 millisec), with rotational latency of random seek()s probably making up that variance
- After enabling the SSD cache, disk is only consulted if there is a miss from DRAM and the SSD cache. The cloud of operations taking > 2 millisec is much lighter, those operations now occupying a thick band of SSD hits in the > 21 microsec and < 2 millisec