Probabilistic Counting with HyperLogLog++
The problem we are solving
Given a million objects, tell me how many distinct colors are seen?
Stated formally, our goal is to estimate cardinality
Naively: keep a stack or queue of colors seen, compare every new object seen against the ones previously seen. The worst case is each object has a different color. So for N colors seen, this is O(N) in memory, and O(N) in time per object, and so O(N^2) time
We could use a hashmap for set membership Still O(N) in memory, now only O(1) time per object, so overall O(N) in time
The intuition
The intuition is built on coinflipping. How does coinflipping help?
If you flip a coin N times, what’s the likelihood of getting a single outcome N times? 2^-N
The probability of a run of 20 Heads on a coin is 1 in a million. So expectation is that you are throwing a coin 20*500k times before your chances exceed 1 in 2. It’d take days to weeks.
Intuition: the length of your longest run is roughly equivalent to the time spent flipping coins
The algorithm
Take our original object and pass it through a hashing function that emits a bit sequence matching its color. (consistent, one-way transformation)
Look at the righthand side of the bit sequence for the number of contiguous 0s. We maintain a single integer, a scorecard, which contains the maximum run length we’ve seen.
The more objects we look at, the higher we expect the value on that scorecard.
If at a given moment, the scorecard value is X, then probably we’ve seen about 2^X distinct values. This is usually within 2% of the real value
Refinements
[7:15] What happens if we just get unlucky, e.g. the first item seen has a color with a run of 20 zeroes at the end?
We keep multiple scorecards. The first bits of the hash value indicate which scorecard we will update, then we update that scorecard as before.
Now we have multiple scorecards. We take the probability each scorecard would predict, and take their harmonic mean (take the number of probabilities, and divide by their sum). The number of scorecards is basically what you can tune to get a more precise result.
These scorecard values are small, so you can often store them in only 5 or 6 bits. The set of scorecard values is an HLL sketch.
The union of two HLL sketches is lossless, meaning if you need to do cardinality estimation on a lot of objects, you can partition the data among several computers, get each computer to generate an HLL sketch, and send those sketches to some central computer for merging.
Nitty gritty details
(not in video, added in a comment)
HLL++ struggles with low cardinalities, because 2^N has large granularity relative to the size of small N. Many implementations start with a precise count, using a sorted list or hashmap, switching to HLL++ when the unique item count exceeds a threshold.
Because HLL++ operates in fixed-size memory space, no memory allocation overheads. Non-cryptographic hash functions are fast (murmurhash with 64 or 128bit output), and counting the number of leading/trailing 1s/0s is often a built-in processor instruction.
HLL thrives in situations where the underlying data can’t fit on one computer.
Cardinality is useful for all sorts of situations. Distinct session IDs, IPs, src-dest tuples.
HLL merge by taking the maximum from each respective scorecard. This works because maximum is associative and commutative, so when an N-way merge takes the maximum on each scorecard, it’s guaranteed to give the same result if a single computer built a single sketch out of all N subsets of the data.
Video based on
Journal Papers
Flajolet, Philippe, et al. “Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm.” Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science, 2007. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Heule, Stefan, Marc Nunkesser, and Alexander Hall. “Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm.” Proceedings of the 16th International Conference on Extending Database Technology. 2013. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf1
Earlier work
Durand, Marianne, and Philippe Flajolet. “Loglog counting of large cardinalities.” Algorithms-ESA 2003: 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003. Proceedings 11. Springer Berlin Heidelberg, 2003.
Flajolet, Philippe, and G. Nigel Martin. “Probabilistic counting algorithms for data base applications.” Journal of computer and system sciences 31.2 (1985): 182-209.
Articles
https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869 https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/