Command-line Tools can be 235x Faster than your Hadoop Cluster
Introduction
As I was browsing the web and catching up on some sites I visit periodically, I found a cool article about using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games […].
Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task
Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec).
One especially under-used approach for data processing is using standard shell tools and commands. The benefits of this approach can be massive, since creating a data pipeline out of shell commands means that all the processing steps can be done in parallel.
An additional point is the batch versus streaming analysis approach. Tom mentions in the beginning of the piece that after loading 10000 games and doing the analysis locally, that he gets a bit short on memory. This is because all game data is loaded into RAM for the analysis. However, considering the problem for a bit, it can be easily solved with streaming analysis that requires basically no memory at all. The resulting stream processing pipeline we will create will be over 235 times faster than the Hadoop implementation and use virtually no memory.
Learn about the data
Sample data
Build a processing pipeline
This is a very straightforward analysis pipeline, and gives us the results in about 70 seconds. While we can certainly do better, assuming linear scaling this would have taken the Hadoop cluster approximately 52 minutes to process.
So even at this point we already have a speedup of around 47 with a naive local solution. Additionally, the memory usage is effectively zero since the only data stored is the actual counts, and incrementing 3 integers is almost free in memory space terms. However, looking at
htop
while this is running shows thatgrep
is currently the bottleneck with full usage of a single CPU core.
Parallelize the bottlenecks
This problem of unused cores can be fixed with the wonderful
xargs
command, which will allow us to parallelize thegrep
.
This
find | xargs mawk | mawk
pipeline gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.
Conclusion
Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.
Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.