The research roots of finding “optimal routing” networks trace back to the late 1970s. Mathematicians defined special kinds of networks called “expanders“. These are graphs with strong connectivity properties guaranteeing no subset of vertices can be isolated from the rest. In 1976, Leslie Valiant gave one of the earliest discussions of such graphs. Following work on Alon-Boppana on trying to understand the best “possible” expanders, mathematicians (notably, Lubotzky, Phillips, and Sarnak) gave constructions of such optimal expanders. These were intricate designs, used advanced number theory, and only work for specific network sizes and degrees.
Could there be a simpler, general purpose construction? In 1991, Friedman showed that a randomly wired network is, with high probability, nearly as good an expander as the best explicit construction. (A recent mathematical result of 2023 actually shows that random graphs match this bound.) The implication was tantalizing: if you want an optimal network for routing, you could simply wire it at random.
Meanwhile, the networking industry took a different path entirely. Inspired by Clos interconnects in switches, since the mid-1980s, communications networks were built on the fat-tree topology (a folded Clos) with two, three, or more layers of switches. As cloud computing exploded in the late 2000s, fat-trees were scaled up with increasing sophistication. In 2009, nine of us lead by Albert Greenberg published “VL2: A Scalable and Flexible Data Center Network”, which pushed the fat-tree architecture to new heights by introducing flat addressing and — notably — randomized Valiant Load Balancing to spread traffic uniformly across network paths. In 2019, the VL2 paper was awarded the SIGCOMM test of Time award. The VL2 work demonstrated that even within structured topologies, randomization of traffic (if not of topology) improved performance. But the underlying network remained hierarchical, rigid, and complex to cable.
In 2012, researchers at the University of Illinois connected random graphs and data center networks in a proposal called Jellyfish. This work generated much follow-on work. Being based on simple theoretical models and simulations, it had left critical hard problems open. Routing in random graphs is tricky because there are many more diversified paths data can take; cabling is harder because endpoints are chosen randomly; and operations become unpredictable. Building random networks at scale remained an elusive target: routing, cabling, and operations were the three unsolved challenges.
RNG (Resilient Network Graphs) history
In 2023, Giacomo Bernardi (AWS principal scientist) started to investigate whether datacenter routers could be arranged in a flat network following Penrose tiling, a geometrical construction where shapes tessellate without ever quite repeating. Ratul Mahajan, an Amazon Scholar and Professor at the University of Washington, was intrigued. The two spent months exploring the idea, building simulations, and pushing the concept as far as it would go.
By mid-2024, their research was hitting a wall: Penrose tiling was promising on paper, but the simulated network was unreliable, and the efficiency gains fell short. They achieved dramatically better outcomes when they replaced structure with randomness. It became an inside joke: “just be random!”.
But there was still a gap: available theory did not address how to build such flat networks at Amazon’s scale. New models needed to be developed to predict performance, guarantee resilience, and make the design operable. So Bernardi and Mahajan sent a Slack message on an internal channel: “any random graph experts here?”. Seshadhri Comandur — an Amazon Scholar and professor of theoretical computer science — enthusiastically joined the effort.
The team tackled the three blockers head-on. For routing, they developed Spraypoint, a forwarding scheme that exploits the expansion properties of the graph to distribute traffic without overwhelming router memory with forwarding state. For cabling, they developed the ShuffleBox—a passive optical device whose internal wiring combined with randomized ShuffleBox-to-ShuffleBox cabling yields “quasi-random” graphs that behave like truly random graphs. For operations, they designed RNG to work with the exact same routers and optics already deployed in fat-tree data centers, built software tooling that translates the abstract graph into port-by-port installation instructions and diagnostics, and developed models (detailed in the research paper) that predict fabric performance from design parameters — allowing deployments to be validated mathematically before being built physically.
The trio now had a design that worked in theory, but no proof it could work in practice. Matt Rehder, VP of Network Engineering, issued a challenge: “If you want to demonstrate that it works, go build the proposed design in an actual data center.” And so they did with a help of small team. The first RNG data center was built near Dublin, Ireland in 2024.
By 2025, the team had learned so much from the data center experiment that they made a bold decision: tear down the network, perfect the design, and build two more data centers networks — one in Germany, one in Spain. The results were striking: compared to traditional fat-tree networks, RNG uses 69% fewer routers, delivers 33% higher throughput, cuts network power by 40%, and lowers operating costs by 27. In early 2026, RNG became the default design for most newly built Amazon data centers globally.
Relative advantages of RNG over Fat Trees
- Resilience: in an RNG fabric, no single router is more critical than any other. The loss of 1% of routers results in a roughly 1% capacity loss — degradation is proportional and predictable rather than catastrophic. In a fat tree network, losing the wrong spine switch can take down a disproportionate share of capacity.
- Efficiency: because all paths through the network are statistically equivalent, capacity is fungible. There is no “stranded bandwidth” locked behind a particular layer — any available capacity can serve any traffic demand.
- Incremental scalability: unlike fat-trees, which come in fixed sizes dictated by switch radix and layer count, an RNG fabric can be scaled up continuously. You add routers and connections without redesigning the topology or hitting a capacity cliff — the graph simply grows.
Relative limitations of RNG (and mitigations)
- Operational complexity: paths through a random graph are less predictable than in a tree, making troubleshooting harder with conventional tools. We mitigate this with purpose-built diagnostic software that gives operators visibility into traffic distribution and fault localization despite the lack of hierarchical structure.
- Performance guarantees are stochastic rather than deterministic. The worst case performance (for metrics such as number of hops and oversubscription) is known, but for RNG our models are stochastic (i.e., the worst case performance is known with high probability). This is a weaker limitation than it might appear. Fat-tree guarantees are also effectively stochastic once you account for real-world failures, which are frequent at scale. RNG simply makes the stochastic nature explicit and designs for it from the start.
References
- RNG research paper: https://arxiv.org/abs/2604.15261
- About Amazon story: https://www.aboutamazon.com/stories/aws-random-graph-theory-data-center-network-design?&utm_term=36
- Amazon Science story: https://www.amazon.science/blog/how-flat-is-replacing-fat-in-aws-data-center-networks
- Video explainer on Youtube: https://www.youtube.com/watch?v=yDoRYRRPOA0
- Relevant images: https://amazongca.getbynder.com/share/B6E5A14E-AFFB-43AD-83599AFEABCBAB6A/?viewType=grid
- “VL2: A Scalable and Flexible Data Center Network” by Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave A. Maltz, Parveen Patel, and Sudipta Sengupta. SIGCOMM 2009





