Scalability! But at what COST?
Abstract
We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation.
COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.
1 Introduction
“You can have a second computer once you’ve shown you know how to use the first one.” – Paul Barham
The published work on big data systems has fetishized scalability as the most important feature of a distributed data processing platform. While nearly all such publications detail their system’s impressive scalability, few directly evaluate their absolute performance against reasonable benchmarks.
1.1 Methodology
In this paper we take several recent graph processing papers from the systems literature and compare their reported performance against simple, single-threaded implementations on the same datasets using a high-end 2014 laptop. Perhaps surprisingly, many published systems have unbounded COST–i.e., no configuration outperforms the best single-threaded implementation–for all of the problems to which they have been applied.
2 Basic Graph Computations
2.1 PageRank
PageRank is a computation on directed graphs which iteratively updates a rank maintained for each vertex. In each iteration a vertex’s rank is uniformly divided among its outgoing neighbors, and then set to be the accumulation of scaled rank from incoming neighbors. A dampening factor
alpha
is applied to the ranks, the lost rank distributed uniformly among all nodes.
2.2 Connected Components
The connected components of an undirected graph are disjoint sets of vertices such that all vertices within a set are mutually reachable from one another.
In the distributed setting, the most common algorithm for computing connectivity is label propagation. In label propagation, each vertex maintains a label (initially its own ID), and iteratively updates its label to be the minimum of all its neighbors' labels and its current label. The process propagates the smallest label in each component to all vertices in the component, and the iteration converges once this happens in every component. The updates are commutative and associative, and consequently admits a scalable implementation.
3 Better Baselines
3.1 Improving graph layout
Up to this point, our single-threaded implementations have enumerated edges in vertex order, whereby all edges for one vertex are presented before moving on to the next vertex.
A single-threaded graph algorithm does not perform explicit communication, but edge ordering can have a pronounced effect on the cache behavior. For example, the edge ordering described by a Hilbert curve, akin to ordering edges (a, b) by the interleaving of the bits of a and b, exhibits locality in both a and b rather than just a in the vertex ordering.
Converting the graph data to a Hilbert curve order is an additional cost in pre-processing the graph. The process amounts to transforming pairs of node identifiers (edges) into an integer of twice as many bits, sorting these values, and then transforming back to pairs of node identifiers. Our implementation transforms the twitter_rv graph in 179 seconds using one thread, which can be a performance win even if pre-processing is counted against the running time.
3.2 Improving algorithms
The problem of properly choosing a good algorithm lies at the heart of computer science. The label propagation algorithm is used for graph connectivity not because it is a good algorithm, but because it fits within the “think like a vertex” computational model, whose implementations scale well. Unfortunately, in this case (and many others) the appealing scaling properties are largely due to the algorithm’s sub-optimality; label propagation simply does more work than better algorithms.
4 Applying COST to prior work
4.1 PageRank
4.2 Graph connectivity
5 Lessons learned
To achieve scalable parallelism, big data systems restrict programs to models in which the parallelism is evident. These models may not align with the intent of the programmer, or the most efficient parallel implementations for the problem at hand. Map-Reduce intentionally precludes memory-resident state in the interest of scalability, leading to substantial overhead for algorithms that would benefit from it. Pregel’s “think like a vertex” model requires a graph computation to be cast as an iterated local computation at each graph vertex as a function of the state of its neighbors, which captures only a limited subset of efficient graph algorithms.
The cluster computing environment is different from the environment of a laptop. The former often values high capacity and throughput over latency, with slower cores, storage, and memory.
Finally, the implementation of the system may introduce overheads that conceal the performance benefits of a scalable system. High-level languages may facilitate development, but they can introduce performance issues (garbage collection, bounds checks, memory copies). It is especially common in a research setting to evaluate a new idea with partial or primitive implementations of other parts of the system (serialization, memory management, networking), asserting that existing techniques will improve the performance. While many of these issues might be improved with engineering effort that does not otherwise advance research, nonetheless it can be difficult to assess whether the benefits the system claims will still manifest once the fat is removed.
There are many good reasons why a system might have a high COST when compared with the fastest purpose-built single-threaded implementation. The system may target a different set of problems, be suited for a different deployment, or be a prototype designed to assess components of a full system. The system may also provide other qualitative advantages, including integration with an existing ecosystem, high availability, or security, that a simpler system cannot provide.
6 Future directions (for the area)
Fundamentally, a part of good research is making sure we are asking the right questions. “Can systems be made to scale well?” is trivially answered (in the introduction) and is not the right question. There is a substantial amount of good research to do, but identifying progress requires being more upfront about existing alternatives. The COST of a scalable system uses the simplest of alternatives, but is an important part of understanding and articulating progress made by research on these systems.