Google's Agentic TPUs: Built for the AI That Acts, Not Just Thinks

Now I have everything I need. Let me write the article.

Google’s latest chip announcement at Cloud Next ‘26 isn’t just a spec bump. It’s an architectural confession — and a strategic declaration of war. For the first time in the history of its custom silicon program, Google split its TPU into two separate chips: one for training, one for inference. That decision alone tells you more about where AI is heading than any press release paragraph about the “agentic era.”

What Google Actually Announced

The eighth generation of Google’s Tensor Processing Units breaks into two products with very different jobs.

TPU 8t is the training workhorse. A single superpod of 9,600 chips delivers 121 ExaFlops of compute with two petabytes of shared high-bandwidth memory. Google claims 2.7x better price-performance over its previous generation (Ironwood TPU, which itself launched just a year ago). The real party trick is near-linear scaling to one million chips in a single logical cluster — enabled by a new network fabric called Virgo combined with JAX and Google’s Pathways software. The pitch: compress frontier model training timelines from months to weeks.

TPU 8i is the inference chip, and it’s the more interesting announcement. It packs 288 GB of high-bandwidth memory alongside 384 MB of on-chip SRAM — that’s 3x more on-chip SRAM than the previous generation — keeping a model’s active working set entirely on-chip rather than constantly swapping from off-chip memory. ICI bandwidth doubles to 19.2 Tb/s, network diameter shrinks by over 50%, and a new Collectives Acceleration Engine (CAE) cuts on-chip latency by up to 5x. There’s also a new serving topology called Boardfly. The headline number: 80% better performance-per-dollar for inference versus Ironwood, which means nearly 2x the customer volume at the same cost.

Both chips claim 2x better performance-per-watt — a number that matters when you’re running racks of thousands of chips around the clock.

The Real Move: Specialization Is an Admission

Here’s what nobody is saying directly: Google bifurcating its TPU line is an acknowledgment that training and inference have become genuinely irreconcilable workloads.

Training is a batch problem. You want maximum throughput, massive collective communication, and the ability to synchronize gradients across thousands of chips. Memory capacity per chip matters, but latency tolerance is high.

Inference — especially agentic inference — is the opposite. Agents run in loops. They call tools, wait for results, generate the next step, and check back. Multi-step reasoning chains with a frontier model can involve dozens of discrete inference passes per user request. Latency is everything. And Mixture-of-Experts architectures (the dominant design pattern for efficient frontier models) create wildly irregular memory access patterns where on-chip SRAM becomes a critical bottleneck.

Google isn’t the first to see this. AWS has been running Trainium for training and Inferentia for inference for years. The difference is that Google’s split comes with frontier-tier numbers and the direct competitive shadow of Nvidia looming over every slide.

The Competitive Math

Nvidia’s Blackwell architecture (B200, GB200) remains the dominant benchmark. Blackwell cards deliver roughly 5x the throughput of H100s at a price tag around $40K per GPU, and they were essentially sold out through mid-2026 with a reported backlog of 3.6 million units before the year even started. That scarcity is both Nvidia’s biggest problem and its biggest moat — if you can’t get the chips, the ecosystem lock-in barely matters.

Google’s positioning is deliberately different. You can’t buy a TPU — you rent access through Google Cloud. That means Google controls the entire stack from silicon to software to pricing. For organizations already inside the Google ecosystem (BigQuery, Vertex AI, Gemini APIs), that’s a meaningful convenience. For everyone else, it means you’re still rewriting your PyTorch code into JAX/XLA to use these chips efficiently, which has historically been the wall that stops most teams from actually switching.

The 80% inference price-performance improvement is real money at scale. If you’re running hundreds of millions of inference calls per day, halving your compute cost isn’t a marketing claim — it’s a nine-figure budget line. But the comparison is against Google’s own previous chip, not Nvidia. When Google benchmarks TPU 8i against a Blackwell B200 for MoE inference workloads, that’s the number worth watching.

Why the Agentic Framing Is Actually Right

Usually when a hardware company stamps “for the agentic era” on a product, you can safely ignore it as positioning. Here, the framing is surprisingly accurate.

Agentic workloads are genuinely different from the inference tasks that dominated 2023-2024. Agents don’t make a single API call and return a result. They reason, retrieve, tool-call, and iterate — often using reinforcement learning from human or environment feedback to improve in near-real-time. That’s why the TPU 8i explicitly targets both inference and reinforcement learning. The RL piece is the tell. Training agents through RL loops at inference time requires fast, low-latency repeated forward passes through large models — exactly what the on-chip SRAM tripling and 5x latency reduction on CAE are designed for.

If the next generation of AI products runs on reasoning loops rather than single-shot responses — and all evidence suggests they do — the infrastructure bet Google is making here is coherent.

The Honest Verdict

The specs are legitimate. 121 ExaFlops in a training pod, 19.2 Tb/s ICI bandwidth, and 384 MB of on-chip SRAM are not marketing rounding errors. Google has spent a decade building custom silicon precisely so it doesn’t have to pay the Nvidia tax — and those investments are showing up in concrete performance numbers.

But three caveats matter.

First, the benchmarks compare to Ironwood TPU, not to Nvidia. Until there’s an apples-to-apples comparison against Blackwell on real MoE inference workloads, the “80% better” number exists inside a Google-defined reference frame.

Second, availability is fuzzy. “General availability later in 2026” is doing a lot of work. Google’s own TPU demand is currently outstripping supply, with older generations at 100% utilization. Getting TPU 8i under your code takes getting into a queue.

Third, the developer experience gap with Nvidia is still real. CUDA and PyTorch is a decade-deep moat. JAX is excellent software, but “excellent software with high switching cost” has never beaten “ubiquitous software with zero switching cost” on adoption curves.

Google’s eighth-gen split is the right architectural decision. Specialized silicon beats general-purpose silicon for defined workloads — that’s not a theory, it’s how CPUs became ASICs for networking and cryptography. The question isn’t whether TPU 8i and 8t are good chips. They are. The question is whether Google Cloud can execute on availability and developer onboarding well enough that teams outside of Google’s own walls actually run their agents on them.

That answer arrives later in 2026. So does the hardware.

Sources: ServeTheHome · TechCrunch · Google Cloud Blog – Technical Deep Dive · CNBC · VentureBeat · Data Center Dynamics

Google's Agentic TPUs: Built for the AI That Acts, Not Just Thinks

What Google Actually Announced

The Real Move: Specialization Is an Admission

The Competitive Math

Why the Agentic Framing Is Actually Right

The Honest Verdict

Sources

Share this article

> Want more like this?

> Related Articles

Google's Prompt Gems: Turn Your Best AI Ideas Into Chrome Tools

GPT-Rosalind: OpenAI's AI Built to Crack the Code of Life

Hyatt's AI Playbook: How OpenAI Is Reshaping Hospitality Work

Tags

> Stay in the loop