Logs vs Structured Events

software

2019-03-06

https://charity.wtf/2019/02/05/logs-vs-structured-events/

Charity Majors

2019-02-05

[Many] developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems constraints/assumptions

We have a monolith service, or a very small number of services. We can model the system in our heads.
Logging is done to local disk, which is both slow at runtime (and expensive at hardware purchase time)
Log lines are spat out inline with execution. A poorly placed printf can take the whole system down.
Investigation is rare, and usually means a human reading error logs.
Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger. (There are few or no network hops between functions.)

Conclusion: Monolithic logging best practices

Be stingy in what you log
Use a debugger to understand internal states of the code
Logs are a last resort

Distributed systems assumptions/constraints

We have many services. A representative request will have “many” hops across “many” services.
We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
Disks and system resources are cheap, ephemeral, all but disposable.
Data services are similarly cheap. We can almost entirely silo application performance off from the cost of writing perf data out.
Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s. And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code. Fixing the code itself is often comparatively trivial, once found.

Logs can’t help you here.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry. You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per request per service.
Think of them as events. Yes, you can stash this stream in a file, but why would you? on what disk? will that work for your serverless functions too? Just stream them over the network to wherever you want to put them.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this.
- Initialize an empty blob at the beginning, when the request first enters the service.
- Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
  - Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time
- Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.
You can see immediately how this method has radically different performance implications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

Distributed systems events best practices

Best Practices
- Emit a rich record from the perspective of the request as it executes the code.
- Emit a single event per request per service that it hits.
- Bypass local disk entirely, write to a remote service.
- Sample if needed for cost or resource constraints. Practice dynamic sampling.
- Treat this like operational data, not transactional data. Be profligate and disposable.
- Feed this data into a columnar store or honeycomb or similar
- Now use it every day. Not just as a last resort. Get knee deep in production every single day. Explore. Ask and answer rich questions about your systems, system quality, system behavior, outliers, error conditions, etc. You will be absolutely amazed how useful it is … and appalled by what you turn up.
Just think:
- No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.
- No more regexps at all, for fuck’s sake.
- No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages
- No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not
Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth. It’s that simple.
No need to shame people from learning best practices that worked perfectly well for a long time. You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front.
The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs. So just don’t call it logs anymore, if that helps. Call it “structured events”.