Nvidia GPUs vs Google TPUs and AWS Trainium Explained

· 10 min read
Nvidia GPUs vs Google TPUs and AWS Trainium Explained
audio-thumbnail
0:00
/76.560667

AI demand has turned chip choice into a business decision, not only an engineering one. If you run model training, large-scale inference, or edge AI, the hardware mix now shapes cost, speed, power use, and lock-in.

That matters because the market is no longer centred on one chip type. Nvidia still leads with GPUs, yet Google, Amazon, Meta, Microsoft and others are building custom silicon for their own AI workloads. The split between training, inference and on-device AI is driving that change.

Why Nvidia's GPUs became the default AI engine

Nvidia's rise in AI came from a chip built for another job. The GPU started as a graphics processor for games, yet its design turned out to be a near-perfect fit for neural networks. As generative AI took off, those same chips moved from gaming PCs into dense server racks across the world's biggest data centres.

The scale is hard to miss. Nvidia said it shipped 6 million Blackwell GPUs over the last year. In one of its rack-scale systems, 72 GPUs are linked so they behave like one giant accelerator for advanced AI workloads. Those systems sell for about $3 million each, and Nvidia has said it is shipping roughly 1,000 a week.

That demand has pushed Nvidia far beyond the role of component supplier. It sells GPUs directly to AI firms, including a reported deal for at least 4 million units to OpenAI. It also sells to governments such as South Korea, Saudi Arabia and the UK, and to cloud providers including Amazon, Microsoft and Google, which then rent that compute out by the hour, minute or second. Nvidia even runs its own GPU rental programme.

Most importantly, Nvidia is selling full systems, not only chips. That matters because performance, power draw and networking all improve when the rack is designed as a whole.

What makes a GPU good at AI

A CPU handles a small number of strong, general-purpose tasks in sequence. A GPU works in a different way. It has thousands of smaller cores, and those cores excel at doing many maths operations at the same time.

That is exactly what modern AI needs. Neural networks rely heavily on matrix multiplication across tensors, which are multi-dimensional data structures. Training and inference both benefit when a chip can run that work in parallel.

AI's turning point came in 2012 with AlexNet. That model won a major image recognition contest by a wide margin, and its researchers used Nvidia GPUs to unlock the parallel compute needed for deep learning. The same hardware trick that helped draw realistic graphics also helped neural nets learn from huge datasets.

Training and inference are the two main AI phases, but they are not the same job. Training teaches a model from large volumes of data. Inference is what happens later, when the trained model responds to new prompts or inputs. That is the form most people see in daily life, from a coffee-ordering app to workplace software or voice features in earbuds.

One short table helps frame the main chip categories:

Chip typeBest fitMain strengthMain drawbackCommon examples
GPUGeneral AI training and broad inferenceFlexible, high parallel computeExpensive and power-hungryNvidia Blackwell, AMD Instinct
Custom ASICLarge-scale, specific AI workloadsBetter efficiency for fixed tasksHard to change after designGoogle TPU, AWS Trainium, Inferentia
NPUAI on phones, laptops and embedded devicesLow power, low latency, local privacyLimited compute scaleApple Neural Engine, Snapdragon NPU
FPGAReconfigurable special workloadsCan be reprogrammed after manufactureLower AI performance and efficiencyAMD Xilinx, Intel Altera

The table shows why GPUs still hold such a strong position. They are costly, but they can handle many different AI jobs without redesigning the hardware.

AMD is the main GPU rival

Nvidia is not alone in rack-scale AI systems. AMD is the clearest challenger with its Instinct GPU line, and it has won large commitments from companies such as OpenAI and Oracle.

The software layer is a major difference between the two. Nvidia's GPUs are closely tied to CUDA, its own software platform. CUDA is one reason Nvidia has such a strong developer base. AMD, by contrast, leans more on an open software approach. That gives buyers a different trade-off. They may get more openness, but Nvidia still benefits from years of tooling, optimisation and installed code.

Nvidia also keeps widening its system-level pitch. Its Grace Blackwell machines combine GPUs with the Grace CPU as a host processor, because even AI accelerators still need CPUs for many control tasks. Meanwhile, Nvidia says its next-generation Rubin GPU will enter full production next year.

For readers who want a vendor view of the current Blackwell push, Nvidia has published its own Blackwell training performance details.

The market is moving from training to inference

The first wave of large language models put most attention on training. That made sense. Pre-training giant models takes vast amounts of compute, and GPUs were the obvious fit.

Now the balance is shifting. Post-training methods have made models more useful, and the next step is serving those models to users at scale. That means inference, not only training, is becoming the centre of the economics.

The AI chip market is splitting by workload. GPUs still dominate flexible training, while custom chips gain ground where inference scale and power efficiency matter most.

Inference is how AI becomes a product. It is the chatbot reply, the coding assistant suggestion, the ranking result, or the speech feature inside an app. As usage grows, cost per token, latency and power draw become harder to ignore. A flexible GPU can do that work, but it is not always the cheapest way to do it.

That shift explains why custom ASICs are growing so quickly. A GPU is broad and adaptable. An ASIC is narrower, but much more focused. For hyperscalers with stable, repeatable workloads, that focus can save both money and electricity.

Nvidia has not ignored this change. Its public material on AI inference at scale with Blackwell makes clear that inference is now a major selling point, not a side note. Still, the more the market moves towards repeated serving work, the more room there is for purpose-built silicon.

Why Google and AWS built custom AI chips

Custom ASIC means application-specific integrated circuit. The name says most of what matters. These chips are built for a narrower set of tasks than a GPU, and that focus can make them faster and more efficient for those tasks.

The trade-off is flexibility. Once an ASIC is taped out and made in silicon, you cannot change its core design. A GPU can support many AI workloads. An ASIC is better when the workload is known in advance and will run at huge scale for years.

That is why the biggest cloud firms can justify the cost. Designing a custom AI chip can cost tens or even hundreds of millions of dollars. Start-ups usually cannot absorb that. Hyperscalers can, because they spread that cost across enormous fleets and reduce their dependence on Nvidia at the same time.

Google's TPU set the pattern

Google was first to move here. It introduced the Tensor Processing Unit, or TPU, in 2015. That gave Google an in-house AI accelerator years before most rivals had one.

The TPU effort also sat inside the same broader research environment that produced the transformer architecture at Google in 2017. Since transformers power almost all modern generative AI systems, that timing matters.

Google showed its sixth-generation TPU, Trillium, in 2024. A system can link multiple chips together with host CPUs so they act as one large supercomputer. Later, Google unveiled Ironwood as its seventh-generation TPU. It also struck a major deal with Anthropic to train Claude on up to a million TPUs.

Google has often kept TPUs close to home, even though many engineers see them as serious peers to Nvidia in technical terms. More recently, Google has opened cloud access more broadly. Its own Trillium TPU announcement shows how much emphasis it places on memory, interconnect and scale-out design.

AWS split its custom chips by job

Amazon Web Services followed after buying Israeli chip start-up Annapurna Labs in 2015. It announced Inferentia in 2018 for inference, then launched Trainium in 2022 for training. The platform is now moving towards its third generation.

AWS frames the naming clearly. Inferentia is for model serving. Trainium is for model training. Over time, though, Trainium has been positioned as useful for both jobs.

The internal architecture differs from Google's approach. Trainium is described more as a cluster of many smaller tensor engines. TPU has been presented as a more rigid matrix-focused grid. Those design choices shape how each platform balances flexibility, throughput and software support.

AWS says Trainium delivers about 30 to 40% better price-performance than other hardware options inside AWS. It has also built a massive Trainium 2 cluster in northern Indiana, where Anthropic is training models on 500,000 chips. That deployment matters because it shows a hyperscaler can build a large AI cluster without Nvidia at the centre of that specific site.

Even so, AWS keeps buying large volumes of Nvidia GPUs for other data centres, because customer demand still depends heavily on Nvidia capacity.

Broadcom, Marvell and the wider ASIC field

Many hyperscalers do not want to build every part of a chip programme on their own. That is where partners such as Broadcom and Marvell come in. They provide design IP, silicon know-how and networking support.

Broadcom is the most visible example. It helped with Google's TPU work, supported Meta's training and inference accelerator effort, and is part of OpenAI's reported plan to build custom ASICs from 2026. This part of the market often gets less attention than the cloud brands, yet it may capture a large share of the value outside the GPU vendors themselves.

The list of other players keeps growing. Microsoft has announced Maya chips for Azure, although later versions have faced delays. Intel has its Gaudi line. Tesla has announced its own ASIC. Qualcomm is pushing into data-centre AI with the AI 200. Start-ups are also trying unusual approaches, including Cerebras with full-wafer chips and Groq with inference-focused language processing units.

The pattern is clear. The biggest buyers still want Nvidia and AMD capacity, but they also want more control over cost, supply and power use.

Edge AI is a different race

Not all AI runs in a data centre. More of it now runs on the device in your hand, on your desk, or inside a car.

That changes the chip design problem. A phone or laptop cannot house a giant data-centre accelerator. It needs a compact, low-power block inside a system-on-chip, or SoC. That block is usually called an NPU, short for neural processing unit.

NPUs bring AI closer to the user

An NPU is a dedicated AI accelerator integrated into a larger chip. It shares space with CPU, graphics and other functions, because a phone or laptop needs one compact package.

The main gain is local processing. When AI runs on the device, data does not always need to go back to a remote data centre. That improves response time and can help protect privacy. It also gives the device maker tighter control over the user experience.

Apple has made this a core part of its hardware story. Its M-series chips for MacBooks include a dedicated neural engine, even if Apple does not always use the term NPU. Similar neural accelerators now sit inside recent iPhone A-series chips. Qualcomm, Intel and AMD are also major NPU suppliers for AI PCs.

On Android devices, Qualcomm's Snapdragon line includes NPUs, and Samsung builds its own NPU features into Galaxy phones. The same class of chip is now moving into cars, cameras, robots, smart-home devices and other embedded systems. NXP and Nvidia are both active in those markets.

The centre of spend is still in the data centre today. Over time, more AI will spread to phones, wearables, vehicles and industrial hardware.

FPGAs still matter when flexibility wins

Field-programmable gate arrays sit in a middle ground. They can be reconfigured with software after manufacture, which makes them useful when a workload may change or when buyers do not want to design a full ASIC.

That flexibility comes with a cost. FPGAs usually deliver lower raw AI performance and weaker energy efficiency than a purpose-built ASIC. If you will deploy thousands of identical accelerators, an ASIC is often cheaper in the long run.

AMD became the largest FPGA vendor after buying Xilinx for $49 billion in 2022. Intel is also a major player because of its $16.7 billion Altera acquisition in 2015.

The AI chip race also depends on factories and power

The biggest AI chip brands do not manufacture most of these processors themselves. They depend on foundries, and one company sits at the centre of that map: TSMC in Taiwan.

That dependence has made chip supply a geopolitical issue. Advanced AI processors are strategic assets, and concentration in Taiwan has pushed the US to expand local manufacturing. TSMC now has a major fabrication plant in Arizona. Apple has committed some production there, although its latest A19 Pro remains tied to TSMC's 3-nanometre process in Taiwan.

Nvidia's Blackwell is built on TSMC's 4-nanometre node, which means Blackwell can now be manufactured in Arizona at volume. Intel is also trying to rebuild its foundry position, with advanced 18A chips at a new Arizona fab.

China is building its own path as well. Huawei, ByteDance and Alibaba are among the major Chinese groups developing custom ASICs. Yet US export controls limit access to the most advanced manufacturing tools and to top-tier AI chips such as Blackwell.

Power may become the next bottleneck. Training clusters and inference farms need massive amounts of electricity. If the US wants to keep leading in AI, it needs far more energy infrastructure. Chip design alone will not solve that problem.

What this means for cloud and platform teams

For engineering leaders, the AI chip market is no longer a one-vendor story. It is a portfolio problem. GPUs remain the default for broad workloads, especially when software portability matters. Custom ASICs make more sense when demand is predictable, power costs matter, and the buyer controls enough scale to justify a narrow design.

That split affects budgeting as much as architecture. CFOs care about utilisation and energy. CTOs care about software support, supply and placement. Platform teams care about where data sits, how workloads move, and which clouds offer the right hardware at the right time.

In practice, many organisations will end up using more than one class of chip across more than one environment. Teams comparing shared GPU pools, cloud-specific ASIC capacity and on-prem placement can Book a Meeting with our Infra Experts for a deeper discussion of capacity, data location and workload fit.

Nvidia still has the strongest developer position, and that is hard to replace. Yet the market is large enough that Google, AWS and other silicon builders do not need to beat Nvidia everywhere. They only need to win the workloads that fit their own platforms best.

The AI boom started with GPUs, but it will not end with one chip type. Training, inference, edge compute, manufacturing and power are now pulling the market in different directions.

That is why the most useful question is no longer which chip wins overall. The real question is which chip fits the workload, the software stack and the location where the work needs to run.