Performance Matters

2021-09-07

https://www.youtube.com/watch?v=r-TLSBdHe1A

Emery Berger

2019-09-14

Suppose
- an image search app, you take pictures of things, and it finds matches.
- prototype: app sends to server. server spawns 2 threads. thread 1 compresses & canonicalizes and saves (if not already saved). thread 2 extracts features, looks up matches in an index, then sends results back to the user
You ship it, and find out it’s slow
What would it be like in the 1980s
- Much slower: internet was slower, computers were slower
Performance used to be easy
- Moore’s Law: transistors and clock speed basically went up at the same exponential rate.
- “Performance improvement in the 1980s: just chill and wait for faster hardware!”
- “The chance that you in eighteen months were going to squeeze out double the performance–pretty slender–so you might as well just sip mojitos on the beach.”
Around 2000, we see transistor counts still increasing, but clock speed plateauing.
- Technically Moore’s law still seems to hold, it just no longer predicts performance.
- The limitation is Dennard scaling, aka heat dissipation. If we keep increasing clock speed the heat gets too great.
- So transistor counts keep going up but now we’re adding more cores
- So if you have a program that’s not parallel, it’s not any faster
[6:50] Every app store update is either for bug fixes or performance
A typical performance evaluation
- From original program A, produce an alternative A'
- Measure their performance. A takes 90sec, A' takes 87.5sec.
- 2.8% faster! That’s great, right?
- What about variance?
- So run it 30x, and find it’s still faster.
Why is A' faster than A? We assume it’s because the code itself is intrinsically faster.
Changing the code can lead to knock-on effects unrelated to your intended change.
“Layout biases measurement”, Mytkowicz (ASPLOS 2009)
- The layout of code and data in memory has a big performance impact. So the measurements can be biased by that.
- Link order changes function addresses
- Environment variable size: moves the program stack
- These layout changes can have an effect larger than the impact of -O3 over -O0 (±40%)
Why could this happen?
- Modern processors are insanely complicated.
- Data and instructions are mapped into the cache as cache sets. If your new code manages to stick within a single cache set, then evict-and-fill will be much rarer while computing the same result.
- Maybe the branch predictor; Maybe the TLB; Maybe the branch target predictor; Maybe the prefetcher
[13:41] The problem here is that almost anything could trigger the optimal layout to go toward a bad layout. e.g., adding a malloc, upgrading libc, running from a different directory (such that running as a different user can change layout).
Basically: layout is very brittle
Is it possible to eliminate the effect of layout?
Tool called Stabilizer
- Randomizes function addresses, stack frame sizes, heap allocations, over and over again, during execution
If you use Stabilizer on A and A'
- You’ll tend to see a broader distribution of performance observations, because the randomizations introduce a lot more variability
- “When you were running that program 30 times [before], it was sort of like you were gonna do a survey of 30 people, but you just asked one person.”
- So now you’ll two bell curves, and the question is still “Is A' faster than A?”, which is amenable to null hypothesis significance testing.
- Stabilizer generates a new random layout every 1/2 second, and total execution time is the sum of all periods. And, by the Central Limit Theorem, the sum of a sufficient number of independent, identically distributed random variables is approximately normally distributed (no matter the underlying distribution).
[21:45] Case studies evaluating the LLVM’s optimizations with Stabilizer on each benchmark across the whole benchmark suite
- Using student’s t test, aiming for p-value <= 0.05
- -O2 is statistically significantly different than -O1 on all benchmarks, better for most. Though on some benchmarks there are some that are significant but tiny, and some that are way worse.
- “It turns out that compiler people run these same benchmarks, and overfit the way that they do these optimizations. And some of them lead to layout changes and it wasn’t actually teh code change. And so we can actually distill out this effect”
- So it’s probably way better.
- -O3 vs -O2 is statistically significantly different on about 3/4 of benchmarks, but the effect sizes are tiny, and some are worse.
- But “half of all benchmarks” is a bad way to do the analysis. Instead you pool the results and perform an Analysis of Variance on the pools to find out if the groups differ.
- For the null hypothesis of -O3 = -O2, we get a p-value of 26.4%, meaning they don’t seem to differ.
[25:45] So what do we do for our image search application?
- Normally we’d use a profiler. It captures number of calls to each function and runtime for each function.
- This lets us figure out what code is frequently run and/or runs for a long time.
- But these were developed in an era before parallel/async/concurrent execution
What we want is something more like a ‘causal profiler’
- e.g., something that figures out “if i could speed up Component by this much, it would speed up Program by this much”
- Suppose we invented some magic for use in performance experiments that let us speed up a Component by various percentages and measure the results
- What do we do instead? A virtual speedup. Rather than speed up a single component, find every component that may run concurrently with the component in question and slow them down.
- Sometimes optimizing things makes your program run slower. The intuition behind this is that if one part of your program is too fast, it can cause contention on a lock which slows down the slowest part of your program.
[31:45] “What do we care about in [the image search app]? We care about two things: we care about how long it takes between a request and a response, right, aka latency? Traditional profilers don’t do this at all. It’s just total runtime. Oh! Let me get on my soapbox for a moment. Traditional profilers are about end-to-end runtime. You know how your servers are about end-to-end-runtime? Or your browser? Like, if only your browser could quit faster. Right. So again, it was all about, like, here’s a program and I run it in console, and it does something and it’s done and that’s all I cared about. That’s not really today.”
Okay, we also care about throughput
- So for our causal profiler, we wnat to add progress points / milestones in program execution. And for a given program and workload, we tally up the time these milestones are reached.
- What if we care about latency? We add latency progress points. At the beginning we add a point for a started transaction, at the end we add one for a finished transactions, and we subtract these to figure out how many are stuck within the system at a given moment.
- Little’s Law says that latency = transactions / throughput, and we already have throughput, so we get latency
coz-profiler, a causal profiler that ships with Debian/Ubuntu
[35:00] So we didn’t build the image search app, but we found OSS that could be combined to build it. Parsec suite of ready made parallel applications that is well-suited to this task: ferret (image comparison), dedup (compression), sqlite (storage)
We looked at each of these and uncovered a number of surprising optimization opportunities
Ferret
- Found 3 specific lines that Coz found to be program bottlenecks, occurring in 3 different sections of the application
- Ferret is pipelined, with 4 different stages (segmentation → feature extraction → indexing → ranking).
- So the approach was to reassign underutilized threads from the nonbottlenecked stage (feature extraction) to the bottlenecked stages. When ranking got boosted from 16 to 22 threads, that stage realized a 21% increase in throughput (which is what Coz had predicted)
Dedup
- Image deduper works by cutting up images into the subsets that are identical versus the parts that are distinctive to only one of the images, and store these subparts separately.
- It throws all these pieces into a hash function.
- The hash table is accessed concurrently, with each bucket having its own lock.
- Coz found that one particular bucket was important. So first tried to make many many more buckets (aiming to get fewer collisions). But it had no effect.
- So instead, looked at the hash function.
- “Well, you’re like, how can the hash function be broken. Like, we’ve been using hash functions since before Knuth wrote about them. Welllll, turns out people like to roll their own, because it’s fun.”
- The histogram of items per bucket revealed 2.3% bin utilization. Fixing the hash function resulted in an 82% bucket utilization
- Blocks per bucket had been 96.7% before, and afterwards was 2.09, a 96% traversal speedup. Coz had predicted a 9% speedup in deduping, which is again what Coz had predicted.
SQLite
- SQLite sets up a virtual table at compile time based on flags. This means the code can stay sane, but it results in indirect calls at runtime. Indirect calls aren’t slow, but they are overhead.
- An indirect call to unlock a thread takes as long as unlocking the thread, so it doubles the time taken (which adds up).
- Making direct calls resulted in a 25% speedup for this workload.
[41:40] Other optimizations made using these techniques