A Taxonomy of Tech Debt

2018-06-03

https://engineering.riotgames.com/news/taxonomy-tech-debt

Bill “LtRandolph” Clark

2018-04-10

Metrics

Impact

The first axis is the most obvious: the impact of the debt. This takes the form of player-facing issues (bugs, missing features, unexpected behavior), and developer-facing issues (slower implementation, workflow issues, random useless shit to remember). It’s worth noting that “developer” in this case can be anyone of any discipline.

Fix Cost

If we decide to fix an issue in our code or data, it will require someone’s measurable time to fix. If it’s a deeply rooted assumption that affects every line of code in the game, it may take weeks or months of engineering time. If it’s a dumb error in a single function, it may be fixable in a matter of minutes. Regardles of the time to implement a fix, though, we also must consider the risk of actually deploying that fix. Even a system I consider “wrong” can still be used as a tool to make a great game.

Contagion

If this tech debt is allowed to continue to exist, how much will it spread?
If a piece of tech debt is well-contained, the cost to fix it later compared to now is basically identical. You can weigh how much impact it has today when determining when a fix makes sense. If, on the other hand, a piece of tech debt is highly contagious, it will steadily become harder and harder to fix. What’s particularly gross about contagious tech debt is that its impact tends to increase as more and more systems become infected by teh technical compromise at its core.

Types of Debt

Local Debt

As far as the rest of the game is concerned, the local system […] works pretty reliably. No one needs to keep the debt in mind as they develop around the system. But if anyone opens the lid and looks inside, they’ll be horrified, disgusted, or completely confused by what they see.
In general, local debt is defined by a low contagion score. If the impact is higher than the cost to fix, it tends to get fixed by a good citizen before too long.

MacGyver Debt

MacGyver debt is named after the TV show from the mid 80s. Angus MacGyver would solve problems using his swiss army knife, duct tape, and wahtever else was on hand. His solutions often involved attaching two unlikely pieces; in the context of tech debt, this means two conflicting systems are “duct-taped” togather at their interface points throughout the codebase.
The biggest cost to MacGyver debt tends to be the intellectual cost of switching modes when crossing boundaries.
The relative contagion of the new system vs. the old system is the key metric to keep an eye out for.
When considering whether to fix MacGyver debt, try to find ways to make the (global) better system more desirable at a local level. If a time-pressured engineer making greedy optimizations during their day to day work chooses to move forward towards the desired end state, then you’re well on your way.
The other approach that can work is to do brute-force large-scale refactors.

Foundational Debt

Fondational debt is when some assumption lies deep in the heart of your system and has been baked into the way the entire thing works.
A hilariously stupied piece of real world foundational debt is the measurement system referred to as United States Customary Units.
Foundational debt tends to index highly on al three axes. The high cost encourages sticking with the janky system, which is often the right call, but the high impact and high contagion mean that fixing egregious foundational debt can have a huge payoff.
The most common strategy for fixing foundational debt that I’ve observed […] is to stand up the new system alongside the old one. If possible, I recommend then converting the foundational debt to MacGyver debt by slowly porting systems over to using the new system with conversion operations available to cross between new and old. This allows you to start reaping the benefits in targeted areas easily while limiting exposure to risk. Sometimes such a conversion isn’t possible, though. In that case, creating a compile time (or if possible, loading time) switch can help build confidence in the new system without forcing you to go all-in.

Data Debt

Data debt starts with a piece of tech debt from one of the other categories.
But then a ton of content (art, scripts, sounds, etc.) gets built on top of that code deficiency. Before too long, fixing the initial tech debt becomes extremely risky and it becomes painfully hard to tell what you’ll break if you try to fix anything.
My favorite real world example for understanding data debt is DNA. The genome of an organism is slowly built up over milions of years through lossy copies (mutations), transcription errors, and evolutionary pressure. Some copy errors are useless but benign, others are harmful, and others confer powerful advantages. Attempting to figure out what any piece of DNA actually does is incredibly difficult.
In general, data debt indexes high on cost to fix since it makes changes hard to evaluate. More worringly, it’s almost always extraordinarily contagious due to a few properties of data (as opposed to code).
- First, it’s generally acceptable to create a new piece of data with a copy/paste of an existing piece of data. […] Any issues with an existing piece of data are propagated out to its descendants.
- Second, data is rarely subjected to technical review akin to code reviews. This makes it difficult to notice and halt the spread of bad practices even if they’re widely known.
- Finally, fixing any issue in the data typically requires a human being with eyes and a brain to verify–a compiler and formal logic won’t cut it.
When fixing data debt, I’ve observed two main approaches.
- The first I call the “do it right checkbox.” This means making a toggle between the old “broken” behavior and the new “fixed” behavior for data creators. Ideally you make the fixed version default while you make sure old content uses the broken version. Then, like with MacGyver debt, you can do a slow and steady replacement to get things onto the new version. This has a permanent cost of adding more and more crap to your editing UI.
- The second approach is the “just fix the damn thing” approach[…]. This means fixing the bug and then trying to repair all the data that’s meaningfully affected. Several techniques can make this less terrifying. First is doing a lot of greps and regex searching to try to understand the theoretical impact. Second is a bunch of targeted testing. Finally, you can prepare a toggle to enable reverting to the old behavior once the fix ships in case you missed something worse than the bug you’re fixing.