Unlearning descriptive statistics

2023-02-01

http://debrouwere.org/2017/02/01/unlearning-descriptive-statistics/

Stijn Debrouwere

2017-07-18

If you’ve ever used an arithmetic mean, a Pearson correlation or a standard deviation to describe a dataset, I’m writing this for you. Better numbers exist to summarize location, association, and spread: numbers that are easier to interpret and that don’t act up with wonky data and outliers.

The average

The arithmetic mean is one of many measures of central tendency. One particularly useful feature of the mean is that, whenever we lack outside information like a scientific theory, it is our best possible guess for what to expect in the future.

Because the mean is so closely tied to the expected value, it’s a great number to use if you’re an economist, a gambler, or an economist gambler.

Often, however, we’re not interested in what we can expect but rather in what is typical, and these are two very different concepts. For example, some statistics about Homo sapiens: a mode of 2 legs, a median of 2 legs, a mean of 1.9 legs and an ordinal center of 1 leg.

The problem with the arithmetic mean is that it does not correspond to anything or anyone, it just blends everything together. The median, on the other hand, can be interpreted as a typical sort of value.

For colors, months, countries, brand names and any other kind of data that is not quantitative and has no order to it, there is no median, and instead the most common values (including the mode) and least common values are a good way to indicate what’s typical and what’s not.

First caveat: multimodal distributions have more than one central or typical value, and they are trickier to describe. […] These local maxima (modes) are useful statistics, but often it pays to prod a little deeper and see why there’s more than one peak in the first place. In the case of human height […]: a typical adult woman is roughly 165 cm tall, a typical adult man roughly 175 cm. Once you split the data by gender, the bimodal distribution disappears.

Second caveat: continuing with our dataset of adult human height, note that there are more women than men on this planet. However, the typical adult human is not therefore a 170 cm tall woman. The median of a dataset with two or more dimensions is not accurately represented by the median of each individual dimension. What you want is the centerpoint or the half-space. With very many dimensions, however, the concept of a central value becomes less and less useful.

The spread

The standard deviation measures how spread out different values are. Here’s how it works:

you subtract the mean from each value

you square each deviation from the mean

you sum up the squared deviations and divide the sum by n

you take the square route of the average squared deviation

Why would you square something only to take its square root a couple steps later? Well, it’s because we’re not interested in whether a value is above or below the mean, but rather we wish to know how far away it is from the mean in either direction.

We square the distances to the mean to make them positive… but why not just remove all signs of the absolute values? Squaring is a mathematical hack: computing the derivative of a squared difference is easy but computing the derivative of an absolute difference is a pain in the neck, and we need that derivative for maximum likelihood estimation of statistical models.

Easy differentiation is nice, but not terribly relevant when all you want to do is describe the spread of your data.

The standard deviation lacks an easy interpretation. People who are new to statistics often seem to think it represents the average distance of a value from the mean, but it doesn’t. In normally distributed data, the standard deviation is about 25% larger than the mean absolute deviation.

When communicating how far apart values are, use the mean absolute deviation or the median absolute deviation (MAD).

An acceptable substitute, also quite easy to interpret, is the interquartile range. […] Half of your data is in between these goal posts. The interquartile range is the measure of spread you will usually see pictured as the box in a boxplot.

The location

Statisticians and mathematicians are lazy, so instead of devising one statistical method that works for data with a mean of 2 and a variance of 5, and another statistical method for data with a mean of 23 and a variance of 8.7, instead we will shrink and squeeze and stretch the data until it fits the method we already have. These standardized numbers are called pivotal quantities, quantities that make no reference to the mean or variance or any other parameter of a statistical distribution, and they are used a lot in statistics.

One such pivotal quantity is the z-score. To convert a dataset into z-scores, subtract the mean from each value and then divide each value by the standard deviation. This normalizes every value to a normal distribution with a mean of 9 and a standard deviation of 1. Once in that standardized format, you can run all kinds of statistical tests, in particular Wald tests.

Normalized data is also useful when comparing things. If you took a test and got 15 out of 20 questions right, is that below or above average, and exactly how far above or below?

Z-scores are great for statistical tests. As a basis for comparisons, they are flawed:

you have to know a lot about statistics to interpret a z-score: that the normal distribution is symmetrical, that +- 1 standard deviation corresponds to 68% of the data and that +- 2 standard deviations corresponds to 95% of the data;

z-scores do not magically turn any data into a normal distribution - if you z-transform data that is skewed, the transformed data will still be skewed.

A more easily interpretable number is the percentile rank. […] Percentile refers to the actual value, percentile rank is the fraction of the data it corresponds to. You can calculate the percentile rank for any value in a dataset.

As with the median, percentile ranks are immune to skew and kurtosis: regardless of whether most of your data is at the top or the bottom or the middle, the rank will give you a good idea of where in the data a value is located, while the z-score turns to gibberish when the underlying distribution of the data is not normal.

The skew

Data is skewed when it contains a disproportionate amount of small or large values, rather than the data being nicely spread out in both directions around the mean. If you graph the distribution, it will look lopsided, with the bulk of the data on one side and a long tail on the other. Negative skewness means the data is skewed to the left, which means it has a fat left tail.

Skewness is a number that is used […] little in statistics.

How can we convey skewness if not through a statistic? For a technical audience, a QQ-plot can communicate how two distributions differ in shape. In every other situation, use a histogram. A histogram organizes the data into an arbitrary number of equal intervals, counts how many points fit in each interval, and plots those counts as a bar chart.

The outliers

One of the most basic statistical laws is that crazy things will happen, and more often than you’d think. As a result, there’s almost never a reason to pay particular attention to outliers

But there are moments when you do need a way to spot anomalies, perhaps to detect fraud or malfunctioning machines.

It is common to look for outliers by identifying values that are more than 3 standard deviations from the mean. […] However, x deviations from the mean is a self-defeating heuristic: it relies on exactly those measures that are inflated or skewed by outliers. Instead, use the median and median absolute deviation as your basic metrics, and pick whatever multiplier that’s of practical relevance to you.

Caveat: be warned that just as a geometric median is no the same as the componentwise medians, which we discussed earlier, an observation can be an outlier even when none of is individual facets are outlying. To stick to anatomical examples: it’s not uncommon to have a Y chromosome, and it’s not uncommon to be a woman, but it would be rather special if a woman had a Y chromosome.

Cook’s distance is one of a number of similar metrics that can detect outliers in multidimensional datasets. […] Instead of hunting for outliers per se, we leave out one observation from the model at a time and check whether this single observation affects the model parameters one way or the other, the idea being that something can only count as outlandish if it has an outlandish impact on how we see the world.

The correlation

A relationship between any two variables is an association, an association between two quantities (not gender or color but distance or weight) is a correlation.

Correlations are typically between -1 and 1, or +- 100% if you prefer. Negative correlations simply mean that as one thing goes up, the other goes down.

the Pearson product-moment correlation is a measure of linear association. The most popular flavor of statistical regression builds on Pearson’s correlation by means of the variance-covariance matrix and is known as first-order linear regression. It works by drawing lines. Lines are really simple mathematical objects, y = ax + b, which is why we like them so much. Statisticians can do all sorts of crazy things with lines that make them not lines anymore while they get to pretend that they still are. The squiggly curves of polynomial regression still count as linear regressions, for one.

Fundamentally, though, a correlation is still just a line, and not every relationship between two variables can be captured by a linear relationship that states for each additional x, increase y by this amount. Toxins are generally harmless below a certain threshold and then very quickly become dangerous. Cheaper goods sell more, but below a certain price point other factors weigh more heavily on our purchasing decisions.

So you might think, okay, easy fix, we just need a number that reflects nonlinear correlation

But why do you want a number at all? When describing a dataset, as opposed to running statistical tests, there really isn’t the need to condense data down to a number because you don’t need that number for anything, it’s not the basis for any additional mathematics. Instead, just draw a scatterplot, which shows the relationship in all its messy glory, no matter how bendy or how straight.

Anscombe’s quartet is a famous example of four two-variable datasets that look very different when graphed but that nevertheless have an identical x-y correlation, as well as identical means and standard deviations.

If you do need to emphasize the underlying pattern and don’t care for all of the little dots of a scatterplot, use software to draw a spline over the conditional median or mean of y at every value of x.

Still not happy and absolutely want a number? You would do well to shun correlations even so. While statisticians are generally quite good at estimating a correlation from a picture and vice versa, most people are not. There’s a little game called Guess the Correlation, give it a try. Communicate the linear relationship between two variables through its slope instead, the for each additional x, y will increase by this amount thing we mentioned earlier.

Postscript: why did nobody tell me this?

The discipline we call statistics is a two-headed beast. Descriptive statistics is the attempt to make sense of large amounts of data. Each observation brings its own idiosyncracies, so we must distill the data down to easier to read summaries, charts and comparisons between groups. Inferential statistics then takes these summaries and judges whether they are likely to hold true in general or whether they contain quirks, patterns that are particular to just your data.

Statistics attracts people from many different backgrounds but above all it attracts mathematicians. Descriptive statistics is a matter of communication, cognition, numeracy, even user experience. It’s quite the challenge to do right but for those with a mathematical bend, descriptive statistics can be, well, a snoozefest, a discipline that barely rates above elementary arithmetic. Inferential statistics, on the other hand, is a theoretical delight.

The disdain of statisticians for descriptive work has contributed to a peculiar situation where innovations in visualization are generally the work of outsiders and fringe figures like John Tukey, William Cleveland and Edward Tufte. Another consequence is that the descriptive statistics we use so much–the mean, the standard deviation, the correlation–are our go-to numbers not because they are the nicest way to describe a dataset, but because they are useful building blocks for statistical inference.

It would be nice to have numbers that can do double duty, statistics that work equally well for description and inference. But those numbers do not exist. As a result, everything you know about descriptive statistics is biased towards inference. If you want to become truly great at communicating quantitative information, these not-quite-descriptives have got to go.