Parsing logs 230x faster with Rust

2023-02-12

https://andre.arko.net/2018/10/25/parsing-logs-230x-faster-with-rust/

André Arko

2018-10-25

A single day of request logs [from RubyGems.org] is usually around 500 gigabytes on disk. We’ve tried some hosted logging products, but at our volume they can typically only offer us a retention measured in hours.

With gzip, the files shrink by about 92%, and with S3’s “infrequent access” and “less duplication” tiers, it’s actually affordable to keep those logs in a bucket: each month worth of logs costs about $3.50 per month to store.

Buried in those logs, there are a bunch of stats that I’m super interested in: what versions of Ruby and Bundler are actively using RubyGems.org, or per-version and per-day gem download counts.

is this… big data?

So every day we generate about 500 files that are 85MB on disk, and contain about a million streaming JSON objects that take up 1GB when uncompressed. What we want out of those files is incredibly tiny–a few thousand integers, labelled with names and version numbers.

the slow way

I started by writing a proof of concept Ruby script

Even on my super-fast laptop, my prototype script would take more than 16 hours to parse 24 hours worth of logs.

After setting it aside for a while, I noticed that AWS had just announced Glue, their managed Hadoop cluster that runs Apache Spark scripts.

python and glue

With 100 parallel workers, it took 3 wall-clock hours to parse a full day worth of logs and consolidate the results.

While 3 realtime hour sis pretty great, […] it was using 300 cpu-hours per day of logs […]. That worked out to almost $1,000 per month

maybe rust?

It turns out serde,the Rust JSON library, is super fast. It tries very hard to not allocate, and it can deserialized the (uncompressed) 1GB of JSON into Rust structs in 2 seconds flat.

[nom] could parse a 1GB logfile in just 3 minutes, which felt like a huge win coming from ~30 minutes in Python on Glue.

I went and rewrote my rust program to use the regex crate, and sure enough it got 3x faster. Down to 60 seconds per file, or 30x as fast as Python in Spark in Glue.

release mode

Rerunning the exact same Rust program while passing the --release flag to cargo turned on compiler optimizations, and suddenly I could parse a 1GB log file in… 8 seconds.

thanks, rayon. thayon

Rust also has a parallel iteration library, Rayon. With a 5 character change to my program, Rayon ran the program against multiple log files at the same time. I was able to use al 8 cores on my laptop [though] I only got a 3.3x speedup.

wait, how much?

With each log file taking about 23 seconds, and there being about 500 log files per day, it seemed like I would need about 350,000 seconds of Lambda execution time per month.

Then, when I went to look up Lambda pricing, I noticed that it has a free tier: 400,000 seconds per month.