Jay's blurblog

Flat Datacenter Networks at Scale by James Hamilton
Monday June 8^th, 2026 at 10:53 PM

Perspectives

The research roots of finding “optimal routing” networks trace back to the late 1970s. Mathematicians defined special kinds of networks called “expanders“. These are graphs with strong connectivity properties guaranteeing no subset of vertices can be isolated from the rest. In 1976, Leslie Valiant gave one of the earliest discussions of such graphs. Following work on Alon-Boppana on trying to understand the best “possible” expanders, mathematicians (notably, Lubotzky, Phillips, and Sarnak) gave constructions of such optimal expanders. These were intricate designs, used advanced number theory, and only work for specific network sizes and degrees.

Could there be a simpler, general purpose construction? In 1991, Friedman showed that a randomly wired network is, with high probability, nearly as good an expander as the best explicit construction. (A recent mathematical result of 2023 actually shows that random graphs match this bound.) The implication was tantalizing: if you want an optimal network for routing, you could simply wire it at random.

Meanwhile, the networking industry took a different path entirely. Inspired by Clos interconnects in switches, since the mid-1980s, communications networks were built on the fat-tree topology (a folded Clos) with two, three, or more layers of switches. As cloud computing exploded in the late 2000s, fat-trees were scaled up with increasing sophistication. In 2009, nine of us lead by Albert Greenberg published “VL2: A Scalable and Flexible Data Center Network”, which pushed the fat-tree architecture to new heights by introducing flat addressing and — notably — randomized Valiant Load Balancing to spread traffic uniformly across network paths. In 2019, the VL2 paper was awarded the SIGCOMM test of Time award. The VL2 work demonstrated that even within structured topologies, randomization of traffic (if not of topology) improved performance. But the underlying network remained hierarchical, rigid, and complex to cable.

In 2012, researchers at the University of Illinois connected random graphs and data center networks in a proposal called Jellyfish. This work generated much follow-on work. Being based on simple theoretical models and simulations, it had left critical hard problems open. Routing in random graphs is tricky because there are many more diversified paths data can take; cabling is harder because endpoints are chosen randomly; and operations become unpredictable. Building random networks at scale remained an elusive target: routing, cabling, and operations were the three unsolved challenges.

RNG (Resilient Network Graphs) history

In 2023, Giacomo Bernardi (AWS principal scientist) started to investigate whether datacenter routers could be arranged in a flat network following Penrose tiling, a geometrical construction where shapes tessellate without ever quite repeating. Ratul Mahajan, an Amazon Scholar and Professor at the University of Washington, was intrigued. The two spent months exploring the idea, building simulations, and pushing the concept as far as it would go.

By mid-2024, their research was hitting a wall: Penrose tiling was promising on paper, but the simulated network was unreliable, and the efficiency gains fell short. They achieved dramatically better outcomes when they replaced structure with randomness. It became an inside joke: “just be random!”.

But there was still a gap: available theory did not address how to build such flat networks at Amazon’s scale. New models needed to be developed to predict performance, guarantee resilience, and make the design operable. So Bernardi and Mahajan sent a Slack message on an internal channel: “any random graph experts here?”. Seshadhri Comandur — an Amazon Scholar and professor of theoretical computer science — enthusiastically joined the effort.

The team tackled the three blockers head-on. For routing, they developed Spraypoint, a forwarding scheme that exploits the expansion properties of the graph to distribute traffic without overwhelming router memory with forwarding state. For cabling, they developed the ShuffleBox—a passive optical device whose internal wiring combined with randomized ShuffleBox-to-ShuffleBox cabling yields “quasi-random” graphs that behave like truly random graphs. For operations, they designed RNG to work with the exact same routers and optics already deployed in fat-tree data centers, built software tooling that translates the abstract graph into port-by-port installation instructions and diagnostics, and developed models (detailed in the research paper) that predict fabric performance from design parameters — allowing deployments to be validated mathematically before being built physically.

The trio now had a design that worked in theory, but no proof it could work in practice. Matt Rehder, VP of Network Engineering, issued a challenge: “If you want to demonstrate that it works, go build the proposed design in an actual data center.” And so they did with a help of small team. The first RNG data center was built near Dublin, Ireland in 2024.

By 2025, the team had learned so much from the data center experiment that they made a bold decision: tear down the network, perfect the design, and build two more data centers networks — one in Germany, one in Spain. The results were striking: compared to traditional fat-tree networks, RNG uses 69% fewer routers, delivers 33% higher throughput, cuts network power by 40%, and lowers operating costs by 27. In early 2026, RNG became the default design for most newly built Amazon data centers globally.

Relative advantages of RNG over Fat Trees

Resilience: in an RNG fabric, no single router is more critical than any other. The loss of 1% of routers results in a roughly 1% capacity loss — degradation is proportional and predictable rather than catastrophic. In a fat tree network, losing the wrong spine switch can take down a disproportionate share of capacity.
Efficiency: because all paths through the network are statistically equivalent, capacity is fungible. There is no “stranded bandwidth” locked behind a particular layer — any available capacity can serve any traffic demand.
Incremental scalability: unlike fat-trees, which come in fixed sizes dictated by switch radix and layer count, an RNG fabric can be scaled up continuously. You add routers and connections without redesigning the topology or hitting a capacity cliff — the graph simply grows.

Relative limitations of RNG (and mitigations)

Operational complexity: paths through a random graph are less predictable than in a tree, making troubleshooting harder with conventional tools. We mitigate this with purpose-built diagnostic software that gives operators visibility into traffic distribution and fault localization despite the lack of hierarchical structure.
Performance guarantees are stochastic rather than deterministic. The worst case performance (for metrics such as number of hops and oversubscription) is known, but for RNG our models are stochastic (i.e., the worst case performance is known with high probability). This is a weaker limitation than it might appear. Fat-tree guarantees are also effectively stochastic once you account for real-world failures, which are frequent at scale. RNG simply makes the stochastic nature explicit and designs for it from the start.

References

RNG research paper: https://arxiv.org/abs/2604.15261
About Amazon story: https://www.aboutamazon.com/stories/aws-random-graph-theory-data-center-network-design?&utm_term=36
Amazon Science story: https://www.amazon.science/blog/how-flat-is-replacing-fat-in-aws-data-center-networks
Video explainer on Youtube: https://www.youtube.com/watch?v=yDoRYRRPOA0
Relevant images: https://amazongca.getbynder.com/share/B6E5A14E-AFFB-43AD-83599AFEABCBAB6A/?viewType=grid
“VL2: A Scalable and Flexible Data Center Network” by Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave A. Maltz, Parveen Patel, and Sudipta Sengupta. SIGCOMM 2009

Read the whole story

JayM

11 hours ago

reply

Atlanta, GA

Tump administration to remove 900 deep sea monitoring instruments that would have studied the collapsing Atlantic current by Adam Kovac
Monday June 8^th, 2026 at 4:28 PM

Latest from Live Science

The Ocean Observatories Initiative has been collecting data on physical, chemical, geological and biological conditions in the Atlantic and Pacific Oceans for the past decade

Read the whole story

JayM

18 hours ago

reply

Oh jeesh.

Atlanta, GA

Leader’s Week 2026: How I Lead a 50+ Person Org on a 4-Day Workweek by Mirek Stanek
Monday June 8^th, 2026 at 4:26 PM

Practical Engineering Management

People often ask me how I manage a 40-person platform engineering organization and the entire R&D Site at the same time, a few side gigs that grow into startups, and still log 1,500km a year training for running races.

The assumption is usually that I have some secret productivity hack or that I’m just working all the time. But working harder is a trap.

Some time ago, I successfully negotiated a transition to a four-day workweek. My team’s output didn’t drop. The expectations placed on me didn’t shrink. I didn’t step back from leadership—I simply stepped away from the noise.

I made it because I am fundamentally a maker, a creator, and a designer. I needed the autonomy to build things on my own terms.

Pulling this off required proving to the business that my value wasn’t tied to the number of days I sat at my desk, but to the tangible, high-leverage outcomes I delivered.

The idea behind my system - the Leader’s Week is simple: focus on a single week, plan those days ruthlessly at the beginning, and objectively sum up the impact at the end.

This framework isn’t just another productivity hack. It is how you shift from being a reactive implementer to a proactive problem-solver. It is the ultimate tool for earning the trust and autonomy required to take control of your schedule.

The Philosophy

Your job is to focus on the right things (and build a product, not just code)

First-level managers—Tech Leaders, Engineering Managers, or Team Leaders—are the linchpins of any tech organization. Your difficult job is to mix hands-on technical reality with strategic business thinking.

But here is where many leaders get it wrong: they treat their teams as simple implementers. A backlog of tickets comes in, and code goes out.

To truly maximize your impact, you need to cultivate a “product sense” within your team. Engineers must be treated as partners in product discovery, not just a delivery mechanism. First and foremost, you need to ensure the team is solving actual business problems, not just writing code for the sake of interesting architecture.

What if you lead a backend, DevOps, or platform team that doesn’t build customer-facing features? The mindset doesn’t change. You are still building a product; your customers are simply internal developers, and your metrics are reliability, developer experience, and deployment speed.

If the “why” behind your projects isn’t clear, you need to anchor your team to the four factors essential to the success of any business.

Every problem you solve should map to one of these:

Growth: Reaching new customers and expanding market share. (e.g., Optimizing an onboarding flow, or ensuring the backend scales to handle a 5x traffic spike).
Expansion: Offering new products and services to existing customers. (e.g., Shipping a highly requested feature, or abstracting a service so it can be reused in a new market).
Profitability: Ensuring you have the resources to continue operating. (e.g., Slashing AWS costs, migrating from expensive enterprise software to an open-source alternative, or automating manual QA).
Customer Satisfaction: Ensuring customers keep coming back. (e.g., Fixing critical bugs, reducing latency by 200ms, or achieving 99.99% uptime).

When planning key initiatives for a week, start with this assessment:

Ask yourself: Are we just writing code, or are we driving Growth, Expansion, Profitability, or Customer Satisfaction?

Outcomes, not inputs (Actionable reality over theoretical hype)

Once you know which of the four business factors you are targeting, you have to measure your success correctly.

In product-minded engineering, the effort you put in is just a cost. The code you write is a liability. The only thing that actually matters is the outcome you achieve.

You can build the most theoretically perfect, hype-driven microservice architecture in the world, but if it doesn’t solve a user problem or move a business metric, it was a waste of a week.

So, what does an actionable outcome look like?

Increased conversion by X% (e.g., The drop-out rate of the sign-up flow went down by 10% after rewriting the validation logic).
Increased revenue by Y% (e.g., Enabled Stripe integration for 3 new European markets, unlocking new MRR).
Reduced infrastructure cost by Z% (e.g., Optimized database queries and adjusted auto-scaling, dropping monthly cloud spend by 20%).
Increased iteration speed (e.g., Cut CI/CD pipeline build times from 45 minutes to 8 minutes, allowing developers to ship 5x faster).

Notice there are almost no technicalities in the primary objectives. You must always know the tangible good you are bringing to your customers or the company.

When you sit down to plan your Leader’s Week, write these outcomes at the very top. Then, list the specific engineering initiatives that will get you there.

Example:
If the outcome is “Reduce infrastructure costs by 15%,” your key initiatives for the week might be:

Identify and shut down orphaned staging environments.
Refactor the 3 most expensive database queries identified in Datadog.
Migrate image storage to a cheaper bucket tier.

At the end of the week, you don’t judge yourself on how many hours you spent looking at AWS dashboards.

You judge yourself by the outcome: Did the daily spend drop?

The Setup

The First Thing on Monday (Design Before You Execute)

When you step into the office on Monday morning, the easiest thing to do is open Slack, check your email, or look at the latest alerts.

Do not do this.

The moment you open those channels, your week is no longer yours. You become reactive, immediately pulled into other people’s priorities, urgent-but-unimportant fires, and administrative noise.

Think of this like building a physical product. A maker does not start cutting material or firing up the 3D printer without a finished CAD model. As a leader, your week is the material. You cannot start executing without a blueprint.

Planning the week must be the absolute first thing you do. Block the first 30 to 45 minutes of Monday morning in your calendar. During this time, you are unreachable. You sit down with your long-term objectives, your strategic context, and your Leader’s Week template. You define the value of the next tens of hours before a single line of code is written or a single 1:1 is held.

If figuring out your plan takes longer than an hour—for instance, because priorities are muddy or you lack strategic clarity—then getting that clarity becomes your first actionable goal.

The “One Priority” Rule (Protecting Deep Work)

There is only one, single, non-negotiable priority for the week.

This is the hardest rule for leaders to accept, but it is the most vital lesson of this framework. When you operate with severe time constraints—like a four-day work week—you cannot afford the illusion of multitasking. If you have five priorities, you have none. You will end Friday having moved five things 10% of the way to completion, generating zero actual value.

Your One Priority is the initiative that, if absolutely nothing else gets done by the end of the week, still makes the week a success.

Because it is the most important thing, it requires the most protected resource you have: uninterrupted deep work.

You must carve out at least 8 hours in your week dedicated solely to this subject. Do not sprinkle this across 30-minute gaps between meetings. Block out massive chunks of time—like two 4-hour sessions—to immerse yourself entirely. Whether that is drafting a critical architectural document, untangling a complex scaling issue, or mapping out the next quarter’s product expansion, this time is sacred.

Timeboxing the Rest

Once the One Priority is locked in, the rest of the week falls into place through aggressive timeboxing:

Two Secondary Targets: Pick exactly two other important things that require your attention (e.g., unblocking a specific team, resolving a structural tech-debt issue). Box 2 hours of total focus for each.
The Radar: Pick three or four essential things to simply monitor. These are KPIs, deployment health checks, or team status updates. You are building awareness, not doing deep work here.
Timebox Everything: Put these blocks directly into your calendar. A task without a calendar block is just a wish.

Systems Need Slack (Do Not Plan 100% of Your Time)

It is incredibly tempting to fill your available working hours entirely with planned blocks. This is a recipe for catastrophic failure.

In engineering, systems running at 100% utilization queue up requests and eventually crash. Your calendar works the exact same way. If your week is fully booked, a single production incident or urgent stakeholder request will cascade, destroying your entire plan.

If you follow the Leader’s Week framework, your planned priorities will consume about 15 to 20 hours.

That is intentional.

The remaining time is your slack. It absorbs the shocks of firefighting, allows for unstructured learning, and most importantly, guarantees you are actually available for your people when they need you. A leader with no slack is a bottleneck.

The Review

The End-of-Week Assessment (Debugging Your Schedule)

Many leaders skip the weekly assessment. They are so exhausted by the week’s sprint that they simply slam their laptops shut the moment the weekend begins.

But skipping the review turns this entire framework into a chore rather than a high-value feedback loop. Unreflected work leads to burnout and repeated mistakes.

The end-of-week assessment is not a report card for your boss; it is a mirror for you. It is the exact mechanism that gives you the peace of mind to completely disconnect. When you know exactly what you accomplished, what went wrong, and what you are leaving for next week, you can step away and recharge.

The Leadership Mirror (Assess Your Performance)

When assessing yourself, look beyond the code, the pull requests, and the Jira boards. High-performing teams are built by leaders who exhibit specific, observable behaviors.

I use the five characteristics of transformational leadership (from the DORA research and the book Accelerate) to run a quick system check on my own behavior:

Visionary: Did I connect our current, messy technical slog to the 6-month product vision?
Inspirational Communication: Did I frame a recent failure as a system flaw to learn from, rather than a human error?
Intellectual Stimulation: Did I challenge my team to rethink an old architectural problem, or did I just hand them the answer?
Supportive Leadership: Did I actively clear a blocker for a struggling engineer?
Personal Recognition: Who did I explicitly and publicly praise this week?

Measure the Reality (KPIs and Outcomes)

Return to the outcomes you defined on Monday morning. Did the needle actually move?

You don’t need a complex dashboard for this. Mark your key initiatives simply:

🟢 On track
🟠 At risk
🔴 Off track

If your “One Priority” is constantly glowing red week after week, you don’t have a time management problem—you have a strategic alignment or resourcing problem. This reality check makes that structural issue visible before it becomes a crisis.

Document the Builds (Celebrate Small Wins)

As makers, creators, and problem-solvers, we have a terrible habit of constantly moving the goalposts. The moment a complex feature ships, a migration finishes, or a nasty bug is squashed, we immediately look at the next mountain.

“It was just a small refactor.”
“It was expected of us anyway.”

Stop rejecting your accomplishments.

The things that feel like “business as usual” today were likely complex, out-of-reach challenges just a few months ago.

Documenting these wins—no matter how small—is the ultimate antidote to imposter syndrome. It builds a historical ledger of your team’s momentum.

Identify the Noise (Review Distractors)

Distractions are bugs in the operating system of your week.

Sometimes they are loud, like a Sev-1 production incident. More often, they are silent killers: recurring status meetings with no agenda, slow CI/CD pipelines, or fragmented communication across too many Slack channels.

Naming these distractors is the first step to eliminating them. If you notice that “ad-hoc requests from stakeholders” derailed your deep-work block three weeks in a row, you now have the actionable data needed to build a new intake process. You transition from complaining about the noise to engineering it out of your system.

The Artifacts (Your Downloadable Interface)

The Leader’s Week is a simple framework, but it requires a tangible artifact to work. You cannot hold this all in your head.

Although we live in a digital-first world, I highly recommend printing the Leader’s Week template. Using a physical pen takes away browser-based distractions and gives you a permanent, physical record of your progress to reflect back on.

The Final Word: Design Your Time, Protect Your Craft

Leadership can be incredibly loud.

If you aren’t careful, the organization will consume every hour you are willing to give it. You will reach the end of Friday exhausted, looking at a calendar full of meetings, wondering what you actually built.

But you are not just a manager. You are a maker. A creator. A designer of systems.

And makers need space.

Implementing the Leader’s Week wasn’t just about becoming a more effective leader. It was the exact mechanism that allowed me to compress my corporate responsibilities into four days. It gave me the leverage to step back, protect my creative energy, and spend my Fridays building my own projects and being fully present for my family.

You don’t have to negotiate a four-day workweek to find value in this framework. But you do need to stop letting the week happen to you.

Next Monday morning, before you open Slack. Before you check your email. Sit down with a blank piece of paper.

Pick your one priority. Draw the boundary. And take your time back.

Supplemental reads

More on Human Jira Router based on Anthropic’s announcements
Developmental Leadership
Leader-Follower model based on David Marquet’s book
Your job is to focus on the right things and why the Hard Work is Overrated

Read the whole story

JayM

18 hours ago

reply

Atlanta, GA

From Technical Debt to Cognitive and Intent Debt
Sunday June 7^th, 2026 at 11:22 PM

ACM Queue - All Queue Content

Read the whole story

JayM

1 day ago

reply

Atlanta, GA

Why Queues Don’t Fix Overload (And What To Do Instead) by Peter Mbanugo
Friday June 5^th, 2026 at 11:09 PM

P99 CONF

This post is about the physical laws of backpressure in software systems, latency death spirals, and why unbounded queues are a bug. Editor’s note: This is a guest post by Peter Mbanugo (originally published on Peter’s blog). Peter will be speaking at P99 CONF 2026. His topic: Multi-Core Without the Trilemma: Escaping Async/Await, Mutexes, and GC. Register now — the conference is free and…

Source

Read the whole story

JayM

3 days ago

reply

Atlanta, GA

Over 116,000 Mincraft systems infected in WeedHack malware campaign by Bill Toulas
Tuesday June 2^nd, 2026 at 9:13 PM

BleepingComputer

A large-scale malware campaign dubbed WeedHack is targeting Minecraft players and has infected more than 116,000 systems since January. [...]

Read the whole story

JayM

6 days ago

reply

Doh.

Atlanta, GA

Flat Datacenter Networks at Scale by James Hamilton Monday June 8th, 2026 at 10:53 PM

Tump administration to remove 900 deep sea monitoring instruments that would have studied the collapsing Atlantic current by Adam Kovac Monday June 8th, 2026 at 4:28 PM

Leader’s Week 2026: How I Lead a 50+ Person Org on a 4-Day Workweek by Mirek Stanek Monday June 8th, 2026 at 4:26 PM

The Philosophy

Your job is to focus on the right things (and build a product, not just code)

Outcomes, not inputs (Actionable reality over theoretical hype)

The Setup

The First Thing on Monday (Design Before You Execute)

The “One Priority” Rule (Protecting Deep Work)

Timeboxing the Rest

Systems Need Slack (Do Not Plan 100% of Your Time)

The Review

The End-of-Week Assessment (Debugging Your Schedule)

The Leadership Mirror (Assess Your Performance)

Measure the Reality (KPIs and Outcomes)

Document the Builds (Celebrate Small Wins)

Identify the Noise (Review Distractors)

The Artifacts (Your Downloadable Interface)

The Final Word: Design Your Time, Protect Your Craft

Supplemental reads

From Technical Debt to Cognitive and Intent Debt Sunday June 7th, 2026 at 11:22 PM

Why Queues Don’t Fix Overload (And What To Do Instead) by Peter Mbanugo Friday June 5th, 2026 at 11:09 PM

Over 116,000 Mincraft systems infected in WeedHack malware campaign by Bill Toulas Tuesday June 2nd, 2026 at 9:13 PM

Flat Datacenter Networks at Scale by James Hamilton
Monday June 8^th, 2026 at 10:53 PM

Tump administration to remove 900 deep sea monitoring instruments that would have studied the collapsing Atlantic current by Adam Kovac
Monday June 8^th, 2026 at 4:28 PM

Leader’s Week 2026: How I Lead a 50+ Person Org on a 4-Day Workweek by Mirek Stanek
Monday June 8^th, 2026 at 4:26 PM

From Technical Debt to Cognitive and Intent Debt
Sunday June 7^th, 2026 at 11:22 PM

Why Queues Don’t Fix Overload (And What To Do Instead) by Peter Mbanugo
Friday June 5^th, 2026 at 11:09 PM

Over 116,000 Mincraft systems infected in WeedHack malware campaign by Bill Toulas
Tuesday June 2^nd, 2026 at 9:13 PM